Skip to main content

Train a Latent DDPM

Key Insight

This project is a miniature of how Stable Diffusion actually works: instead of running DDPM directly on pixels, you first compress each CIFAR-10 image with an 8× VAE into a tiny 4×4 latent, then train a small U-Net to denoise in that latent grid — the core idea of latent diffusion (LDM). Because the latent has roughly 48× fewer numbers than the pixels, every training and sampling step is dramatically cheaper, and you decode back to pixels only at the very end. Comparing it to a pixel-space DDPM on the same data makes the central lesson concrete: most of diffusion's cost is spent on pixel detail the VAE can reconstruct anyway.