Inflate SD to a Video Model

Key Insight

This is the foundational move of the entire video-diffusion era: temporal inflation. You start from a pretrained Stable Diffusion 1.5 U-Net — a model that only knows how to denoise a single still image — and grow it into the time dimension by inserting two kinds of new layer: a temporal convolution that slides a small filter along the frame axis, and temporal attention that lets each spatial position compare itself across every frame. Both are initialized as a pass-through, so before any training the inflated model still behaves exactly like the original image generator; then fine-tuning on a small video dataset teaches only the new layers how things move while the spatial layers keep all their hard-won knowledge of appearance. Unlike the first-frame-conditioned Tiny I2V model, which inflates with temporal convolutions alone to animate one given image, here you build a full video diffusion model that generates an entire clip from noise — making this the cheapest honest way to reach a from-scratch-feeling video generator without paying for from-scratch training.

Key Insight​

Key Insight