Skip to main content

Train a VAE for Diffusion

Key Insight

Every latent diffusion model is only as good as the VAE it generates in, so this project trains that compressor properly before any diffusion happens — on CelebA faces, where it is easy to judge whether reconstructions look real. The recipe is the one Stable Diffusion's VAE descends from: combine a perceptual loss (LPIPS) for sharp textures, an adversarial loss from a discriminator so fine detail looks real instead of blurry, and a light KL penalty to keep the latent space smooth enough to diffuse in. The point is to verify the VAE is a faithful compressor first — a leaky one silently caps the quality of any diffusion model you later train on its latents.