Diffusion on Latents

Key Insight

This project assembles the two halves of modern video generation: put the trained 3D VAE in front of a small video diffusion model, so denoising runs entirely on compressed latent video instead of raw pixels — the same latent-diffusion move that made Stable Diffusion practical for images. Because the latent tensor is roughly 100× smaller, each training step is dramatically cheaper and far longer clips fit in memory than pixel-space diffusion could ever manage. Comparing the two side by side makes the headline result of Phase 5 concrete: latent diffusion trains faster and reaches higher quality at the same compute, because the model spends its capacity on motion and structure rather than on memorizing the pixel-level texture the VAE already reconstructs.

Key Insight​

Key Insight