Video Generation: From Beginner to Advanced
A comprehensive guide to understanding and building video generation systems — from the fundamentals of treating video as a spatiotemporal signal, through latent video diffusion and Diffusion Transformers (DiT), to long-form generation, world models, and the frontier of real-time interactive video.
Video generation = image generation + time. That one sentence is both true and dangerously misleading. The "+ time" introduces problems that have no image-gen analog: temporal consistency, motion priors, enormous compute (a 5-second 720p clip is ~150 images), and the brutal scarcity of high-quality paired video-text data. This guide is about how the field solved (and is still solving) those problems.
Scope and boundaries
This guide owns the generative modeling of video — the moment you add a time axis to image generation and have to model motion, temporal consistency, and the compute explosion that comes with both. To keep the AI Learning Guides mutually exclusive and collectively exhaustive (MECE), it deliberately stops at a few borders and links forward to the guide that owns each one.
In scope — this guide owns these topics:
- Video as a spatiotemporal signal — shapes, frame rate, codecs, the cost model, and why latent compression is non-negotiable
- The time axis on top of diffusion — temporal layers, (2+1)D vs full spatiotemporal attention, temporal inflation of pretrained image models
- 3D / causal video VAEs and discrete video tokenizers — the compressors that make video diffusion tractable
- Video DiTs and Sora-class models — patchified latent video, 3D RoPE, flow matching applied to video
- Image-to-video, video-to-video, and video-specific control — first-frame/keyframe conditioning, camera and motion control, talking heads, video editing
- Long-form and consistent video — sliding-window, hierarchical, autoregressive, and streaming generation
- Generative world models and interactive/playable video — action-conditioned video as a simulator (Genie, GameNGen, driving/embodied world models), from the generation side
- Native audio-video joint generation, video-data engineering, and video evaluation — the parts that differ from the image recipe
Out of scope — deferred to the owning guide:
- Diffusion/VAE/GAN/flow fundamentals, U-Net and DiT mechanics, score/EDM theory, image tokenizers (VQ-VAE, FSQ, LFQ) → Image Generation. This guide assumes all of it and only adds the time axis. If DDPM, latent diffusion, classifier-free guidance, or flow matching feel fuzzy, fix that there first.
- CLIP/T5 text-encoder training, VLMs, any-to-any models, and video understanding → Multimodal Learning. We use a frozen text encoder and recaption with a VLM, but training those, and encoding video for joint reasoning, lives there. We own video synthesis; they own video understanding.
- Model-based RL, the Dreamer policy-learning loop, planning/MPC in a learned model, and training a policy "in the dream" → RL Phase 6 and RL Phase 10. We build the action-conditioned video generator; using it as an environment for control is theirs.
- Vision-Language-Action robot policies, sim-to-real, and embodied control → Robotics Phase 8 and Robotics Phase 9.
- Serving, batching, and inference-latency engineering for deployed video models → Inference Systems; kernel-level performance and quantization → AI Hardware. We discuss step-distillation and real-time generation as modeling problems and link out for the systems side.
- Tensor, autograd, mixed-precision, distributed-training, and training-loop fundamentals → PyTorch Deep Dive.
When this guide touches an out-of-scope topic, it does so only to the depth needed to make a video-generation modeling decision, and it links to the owning guide.
Table of Contents
- Phase 0: Prerequisites
- Phase 1: Foundations — Video as a Tensor
- Phase 2: Classical and Early Neural Video Generation
- Phase 3: Image-to-Video as a Stepping Stone
- Phase 4: Video Diffusion — The Modern Foundation
- Phase 5: Latent Video Diffusion and Video Tokenizers
- Phase 6: Diffusion Transformers (DiT) and Sora-Class Models
- Phase 7: Conditioning, Control, and Editing
- Phase 8: Long-Form and Consistent Video
- Phase 9: World Models and Interactive Video
- Phase 10: Training at Scale, Evaluation, and Frontier Topics
- Suggested Timeline
- Key Advice
- Common Pitfalls to Avoid
- Additional Resources
- Glossary
Phase 0: Prerequisites
Video generation is one of the most demanding topics in modern ML. The prerequisites are non-negotiable — and unusually, they are almost entirely owned by other guides in this collection. This guide adds the time axis; it assumes the rest.
Concepts to Know
The single most important prerequisite is image diffusion. Work through Image Generation Phases 5–8 before starting here; nearly everything below is "that, with a time axis." Specifically you should be fluent in:
- Diffusion models (Image Gen Phase 5): forward/reverse process, DDPM, DDIM, classifier-free guidance, noise schedules
- Score/EDM and flow matching (Image Gen Phase 6 and Phase 8): the σ-parameterization, rectified flow — most 2024+ video models train this way
- Latent diffusion (Image Gen Phase 7): VAE encoder/decoder, training a diffusion model in latent space (i.e., Stable Diffusion)
- DiT and image tokenizers (Image Gen Phase 3 and Phase 8): patchification, AdaLN-Zero, VQ-VAE / FSQ / LFQ
- U-Net architecture: down/up blocks, skip connections, attention blocks
- Transformers and ViT: self-attention, cross-attention, positional embeddings, patchification, 1D-sequence treatment of images
- Text conditioning (a frozen CLIP/T5 encoder here; training one is Multimodal Learning's job): cross-attention for text→image
- PyTorch fluency (PyTorch Deep Dive): mixed precision, distributed training (DDP/FSDP), memory profiling
- Optical flow (helpful): what it is and why it shows up everywhere in video
The One Equation Everything Comes Back To
A video is a tensor of shape (T, H, W, C) — frames × height × width × channels.
Modern video generation models a distribution over this tensor:
p(x_video | text, image, audio, ...)
The dominant approach today: tokenize the video into a (T', H', W') latent
grid with a 3D VAE, then either
(a) run diffusion in that latent space (Sora, Veo, MovieGen), or
(b) autoregress next-token in that latent space (CogVideo, Phenaki),
(c) or a hybrid.
The single hardest problem isn't the model — it's getting (T, H, W) all
big enough to be useful without compute exploding cubically.
Resources
- Image Generation guide — the hard prerequisite; do Phases 5–8 first
- Lilian Weng — What are Diffusion Models? — the canonical primer
- Sora Technical Report (OpenAI, 2024) — read once now, again at the end of Phase 6
- Stable Video Diffusion paper — practical entry point
Phase 1: Foundations — Video as a Tensor
Before models, understand the data. Video has properties that images don't, and they shape every architectural decision later.
Concepts to Learn
- Video shape conventions:
(B, T, C, H, W)(PyTorch) vs(B, C, T, H, W)(3D conv-friendly) — both common, easy to confuse - Frame rate (fps) — 24, 25, 30, 60; the same motion at different fps looks very different to a model
- Video codecs: H.264, H.265/HEVC, AV1, VP9 — most public video is heavily compressed; this matters
- Color spaces: YUV420 (native to most codecs) vs RGB (what your model wants)
- Containers vs codecs:
.mp4,.mov,.webmare containers; H.264, AV1 are the codecs inside them - Temporal redundancy: adjacent frames are nearly identical — both a problem (waste) and an opportunity (compression)
- Motion as a signal: optical flow, motion vectors (already inside the codec), scene cuts
- Data loading is brutal: a 1-minute 1080p clip is gigabytes uncompressed; decode-on-the-fly is mandatory
The Cost of a Single Clip
Resolution × fps × duration → raw tensor size
256×256, 8 fps, 2 sec → 16 frames × 256 × 256 × 3 = 3.1 MB (one clip!)
512×512, 24 fps, 5 sec → 120 frames × 512 × 512 × 3 = 94 MB
720p, 24 fps, 5 sec → 120 × 1280 × 720 × 3 = 333 MB
1080p, 24 fps, 10 sec → 240 × 1920 × 1080 × 3 = 1.5 GB
(All in fp32; halve for fp16 / bf16.)
→ A batch size of 8 at 1080p × 10s is 12 GB just for inputs.
This is why every video model uses a latent VAE.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Video loader benchmark | Compare torchvision.io, decord, pyav, and ffmpeg-python on a folder of .mp4s; report decode time per clip | ⭐⭐ |
| Frame extractor | Sample N frames evenly from a clip; sample N frames at uniform fps; observe the difference for fast vs slow scenes | ⭐⭐ |
| Optical flow visualizer | Compute dense optical flow (RAFT, Farnebäck) between adjacent frames; color-visualize | ⭐⭐ |
| Scene-cut detector | Detect scene boundaries via histogram or feature distance; split a movie into clips | ⭐⭐ |
| Storage study | Take 100 clips, store as raw .npy, H.264 .mp4, and AV1 .webm; compare disk and decode speed | ⭐⭐ |
Sample Code: Loading a Video Clip with decord
import decord
import torch
from decord import VideoReader
decord.bridge.set_bridge("torch") # decode directly to torch tensors
vr = VideoReader("input.mp4", num_threads=2)
fps = vr.get_avg_fps()
total = len(vr)
# Sample 16 frames uniformly across the clip:
indices = torch.linspace(0, total - 1, 16).long().tolist()
frames = vr.get_batch(indices) # (16, H, W, 3), uint8
# Convert to (T, C, H, W) float in [-1, 1] for model input:
frames = frames.permute(0, 3, 1, 2).float() / 127.5 - 1.0
Key Insight
Every operational decision in video generation — frame rate, clip length, resolution, batch size — is a compute-vs-quality trade-off, and they all multiply. Doubling resolution = 4× compute. Doubling frame count = 2× compute. Doubling batch size = 2× compute. Doubling all three = 16×. This is why the field obsesses over latent compression and why nearly every published video model lists its exact (T, H, W) operating point as a design parameter, not an afterthought.
Resources
decord— the standard fast video loaderpyav— Python bindings to ffmpeg- RAFT optical flow paper
- FFmpeg documentation — you will need it
Phase 2: Classical and Early Neural Video Generation
The history matters — it's where you learn what doesn't work and why. Skim, don't memorize.
Concepts to Learn
- Frame interpolation — generating intermediate frames between two real ones (FILM, Super SloMo); a "video generation lite"
- Future frame prediction — given a few frames, predict the next ones (early benchmark: Moving MNIST)
- Video GANs:
- VGAN, TGAN — early attempts, low quality
- MoCoGAN — disentangled motion and content
- DVD-GAN — first plausible-quality short clips
- StyleGAN-V — applied StyleGAN's latent space to video
- Autoregressive pixel models: VideoPixelNetwork, slow but principled
- Recurrent approaches: ConvLSTM, PredRNN — used widely before transformers won
- The limits of these approaches: short, low-resolution, no text conditioning, mode collapse for GANs
Why These Mostly Stopped
Around 2022 the field made three near-simultaneous moves that made
older approaches obsolete:
1. Diffusion proved itself on images (DDPM → Imagen, Stable Diffusion).
2. Latent compression made it tractable for high resolution.
3. Text-image pretraining produced strong text conditioning for free.
Video inherited all three. GAN-based and pure-recurrent video generation
have not seriously competed with diffusion since ~2023.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Moving MNIST predictor | Train a ConvLSTM to predict the next 10 frames given 10; classic baseline | ⭐⭐⭐ |
| FILM frame interpolation | Use a pretrained FILM to interpolate between two real frames; observe motion artifacts | ⭐⭐ |
| Tiny video GAN | Train a small video GAN on UCF-101 face crops — observe mode collapse firsthand | ⭐⭐⭐⭐ |
| Read MoCoGAN | Implement just the latent-disentanglement idea (content + motion latents) in a small VAE | ⭐⭐⭐ |
Key Insight
The pre-diffusion era of video generation is a graveyard of clever ideas that didn't scale. Most of them — disentangled motion latents, hierarchical generation, two-stream architectures — have since reappeared as components inside diffusion-based systems. The ideas were right; the training framework was wrong.
Resources
- Video Prediction Beyond Mean Square Error (Mathieu et al., 2015) — early classic
- MoCoGAN paper
- PredRNN paper
- FILM frame interpolation paper
Phase 3: Image-to-Video as a Stepping Stone
Before generating video from scratch, generate video from an image. This is the conceptually simplest version of the problem and the most practical to start training on.
Concepts to Learn
- The image-to-video (I2V) task — given one frame, produce a clip starting from it
- Conditioning on a still image: concatenate, cross-attend, or AdaLN modulation
- Motion buckets / motion scores — letting the user control "how much motion"
- Camera control — explicit camera trajectory as a side input (CameraCtrl, MotionCtrl)
- The two main outputs of an I2V model: short clips (2–5 sec) and animated stills (subtle motion, longer)
- Stable Video Diffusion (SVD) — the canonical open-weights I2V model; freezes a pretrained image latent diffusion model and adds temporal layers
- AnimateDiff — adds a "motion module" to any community Stable Diffusion checkpoint without retraining the base
Why I2V Is Easier Than T2V
Text-to-video (T2V): text → video (no anchor; must invent everything)
Image-to-video (I2V): image → video (first frame fixes appearance,
model only models motion)
Video-to-video (V2V): video → video (style transfer / restyling)
I2V's training signal is also cheaper: any video is automatically a
training example — first frame is the condition, the rest is the target.
No paired text needed.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Run SVD inference | Generate 14-frame and 25-frame clips with Stable Video Diffusion from arbitrary images | ⭐⭐ |
| AnimateDiff tour | Plug AnimateDiff's motion module into a community SD 1.5 checkpoint; generate animated stills | ⭐⭐⭐ |
| Tiny I2V model | Add 3D temporal conv layers to a frozen SD 1.5 U-Net; fine-tune on 100k clips with the first frame as condition | ⭐⭐⭐⭐⭐ |
| Motion control | Train the above with a motion-score input; verify that low scores produce subtle motion | ⭐⭐⭐⭐ |
| Camera trajectory | Add Plücker-coordinate camera embeddings to an I2V model; verify pan/zoom controllability | ⭐⭐⭐⭐⭐ |
Sample Code: Inflating a 2D Conv to a (2+1)D Conv
import torch
import torch.nn as nn
class Conv2Plus1D(nn.Module):
"""Common pattern: factorize a 3D conv into spatial + temporal."""
def __init__(self, in_c, out_c, k_s=3, k_t=3):
super().__init__()
self.spatial = nn.Conv3d(in_c, out_c, kernel_size=(1, k_s, k_s),
padding=(0, k_s // 2, k_s // 2))
self.temporal = nn.Conv3d(out_c, out_c, kernel_size=(k_t, 1, 1),
padding=(k_t // 2, 0, 0))
def forward(self, x):
# x: (B, C, T, H, W)
return self.temporal(self.spatial(x))
# Init temporal as identity (zeros + identity in middle) so a 2D-pretrained
# model passes video through unchanged at the start of training:
def init_temporal_as_identity(conv):
nn.init.zeros_(conv.weight)
middle = conv.kernel_size[0] // 2
for c in range(min(conv.in_channels, conv.out_channels)):
conv.weight.data[c, c, middle, 0, 0] = 1.0
Key Insight
The dominant pattern across nearly all 2022–2024 video diffusion models is temporal inflation: take a pretrained image model, insert temporal layers initialized as identity, and fine-tune. This preserves the pretrained spatial knowledge while learning motion on top. AnimateDiff, ModelScope, Stable Video Diffusion, and Make-A-Video all use variants of this trick. The 2024–2026 frontier (Sora, Veo, Movie Gen) abandons it in favor of training spatiotemporal models from scratch — but the inflation pattern is still the right starting point for any custom model.
Resources
Phase 4: Video Diffusion — The Modern Foundation
This is where the field is. Master this phase deeply; the next two are refinements.
Concepts to Learn
- Pixel-space vs latent-space video diffusion — pixel space is impractical at any meaningful resolution; latent space is the default
- 3D U-Nets — the natural generalization of 2D U-Nets to (T, H, W)
- (2+1)D factorization — separate spatial and temporal layers; cheaper and easier to initialize from 2D pretrained weights
- Temporal attention — pure attention along the time axis at each spatial position; the modern default for high-quality models
- Spatiotemporal attention — joint attention over
(T × H × W); quadratic in sequence length and very expensive - Video-CFG: classifier-free guidance for video; balancing text alignment against temporal coherence
- Cascaded diffusion for video: low-res video → super-resolution → frame interpolation (Imagen Video, Make-A-Video used this; modern models do it less)
- Noise schedules for video — empirically need lower SNR (more noise) than images at the same resolution
- Joint image-video training — co-train on still images (treated as 1-frame video) to maintain image quality
The 3D U-Net Block
Input: (B, C, T, H, W)
┌─────────────────────────────────────────┐
│ Spatial conv (1×3×3) ──┐ │ inflated 2D conv
│ Spatial self-attention ──┤ │ shared with image weights
│ Cross-attention (text) ──┘ │
│ │
│ Temporal conv (3×1×1) ──┐ │
│ Temporal self-attention ──┤ │ new, initialized as identity
│ ──┘ │
└─────────────────────────────────────────┘
Modern variant: replace all "conv" with "transformer block" → DiT (Phase 6).
Projects
| Project | Description | Difficulty |
|---|---|---|
| Inflate SD to a video model | Take a Stable Diffusion 1.5 U-Net, inflate to 3D (insert temporal conv + temporal attention), train on a small video dataset | ⭐⭐⭐⭐⭐ |
| Joint image-video training | Co-train your inflated model on 90% images, 10% video; compare to video-only training on quality and motion | ⭐⭐⭐⭐ |
| Temporal CFG study | Vary CFG strength independently for text and for image conditioning; observe trade-offs | ⭐⭐⭐ |
| Cascaded super-resolution | Build a small "low-res video → high-res video" diffusion super-resolution model | ⭐⭐⭐⭐ |
| Compare attention patterns | (2+1)D vs full spatiotemporal vs windowed spatiotemporal; measure FLOPs and quality | ⭐⭐⭐⭐ |
Sample Code: A (2+1)D Transformer Block for Video
import torch
import torch.nn as nn
from einops import rearrange
class Video2Plus1DBlock(nn.Module):
def __init__(self, dim, n_heads):
super().__init__()
self.spatial_attn = nn.MultiheadAttention(dim, n_heads, batch_first=True)
self.temporal_attn = nn.MultiheadAttention(dim, n_heads, batch_first=True)
self.norm_s = nn.LayerNorm(dim)
self.norm_t = nn.LayerNorm(dim)
# Zero-init the temporal-attention output projection so the model
# behaves as a still-image model at initialization:
nn.init.zeros_(self.temporal_attn.out_proj.weight)
nn.init.zeros_(self.temporal_attn.out_proj.bias)
def forward(self, x):
# x: (B, T, S, D) where S = H*W spatial positions
B, T, S, D = x.shape
# Spatial: each frame independently attends within itself
h = rearrange(x, "b t s d -> (b t) s d")
h_norm = self.norm_s(h)
h = h + self.spatial_attn(h_norm, h_norm, h_norm, need_weights=False)[0]
# Temporal: each spatial position attends across time
h = rearrange(h, "(b t) s d -> (b s) t d", b=B, t=T)
h_norm = self.norm_t(h)
h = h + self.temporal_attn(h_norm, h_norm, h_norm, need_weights=False)[0]
return rearrange(h, "(b s) t d -> b t s d", b=B, s=S)
Key Insight
The single biggest design lever in video diffusion is what gets attention along the time axis. Pure (2+1)D — spatial attention then separate temporal attention — is cheap and works surprisingly well, which is why it dominated 2023–2024. Full spatiotemporal attention is much more expressive but quadratic in T×H×W, which gets prohibitive fast. Modern Sora-class models pay this cost using 3D latent compression to shrink T×H×W aggressively before attention runs. The trick is moving the expense from attention into the VAE.
Resources
- Video Diffusion Models (Ho et al., 2022) — first principled paper
- Imagen Video paper — cascaded approach, lots of useful detail
- Make-A-Video paper
- Align Your Latents (Blattmann et al., 2023) — the latent video diffusion paper
Phase 5: Latent Video Diffusion and Video Tokenizers
The single most important enabler of modern video generation. If you understand the VAE in image-gen, this is the natural extension — but the engineering is much harder.
Concepts to Learn
- Why latent space is non-negotiable — see Phase 1's storage table
- 2D VAEs for video — run a 2D image VAE per frame; works, but no temporal compression
- 3D VAEs for video — compress in time as well as space; the modern default. Typical compression ratios: 4× temporal, 8× spatial → 32–128× total
- Causal 3D VAEs — first frame encoded with itself only, later frames encoded with causal context. Lets the same model handle still images and video
- Reconstruction quality matters more than for images — temporal flicker in the VAE shows up directly as motion artifacts
- Discrete vs continuous latents:
- Continuous (VAE) for diffusion models
- Discrete (VQ-VAE, FSQ, LFQ) for autoregressive/transformer-style models — MagViT-v2 is the strongest open recipe
- Joint training with images — same caveat as the U-Net case; helps preserve still-image quality
- Two-stage training: train the VAE first, freeze it, then train the diffusion model in its latent space
Latent Compression in Numbers
Raw clip: 120 frames × 720p × 3 channels = 333 MB
After 3D VAE:
Spatial: 8×8 compression → 90×128 per frame
Temporal: 4× compression → 30 frames
Channels: e.g., 16 → (30, 90, 128, 16) ≈ 21 MB
That's 16× less data — and crucially, diffusion now runs over 30 latent
"frames" instead of 120. Memory and compute both drop dramatically.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Frame-by-frame 2D VAE | Use Stable Diffusion's VAE on video frames independently; observe temporal flicker in reconstructions | ⭐⭐ |
| Train a small 3D VAE | (B, 3, T, H, W) → (B, C, T', H', W'); compress 4× in time, 8× in space; train on UCF-101 | ⭐⭐⭐⭐⭐ |
| Causal 3D VAE | Modify the above to causal in time so it handles single images correctly (T=1 → T'=1) | ⭐⭐⭐⭐ |
| MagViT-v2-style tokenizer | Train a discrete video tokenizer using FSQ or LFQ quantization; measure reconstruction FID | ⭐⭐⭐⭐⭐ |
| Diffusion on latents | Plug the 3D VAE in front of a small diffusion model from Phase 4; compare training speed and quality to pixel-space | ⭐⭐⭐⭐ |
Key Insight
The 3D VAE is the unsung hero of modern video generation. Sora's "patches" — its much-discussed innovation — are just patches in the latent space of a 3D VAE. The trick is that a powerful enough VAE compresses video by ~100× while preserving the information that matters for generation, so the diffusion model can train on what used to be a 100-GB clip as if it were a 1-GB clip. Every "Sora-class" model has spent serious effort on its VAE; the best ones have spent at least as much effort there as on the diffusion backbone.
Resources
- MagViT-v2 paper (Yu et al., 2023) — the strongest open recipe for video tokenization
- OpenSora's VAE — open implementation
- CogVideoX paper — recent strong open VAE + DiT
- VideoGPT paper — original discrete-token approach
Phase 6: Diffusion Transformers (DiT) and Sora-Class Models
The current frontier. As of 2026, the strongest video models are all DiT-based, trained on latent video tokens, with text conditioning via cross-attention or token concatenation.
Concepts to Learn
- DiT (Diffusion Transformer) — Peebles & Xie's paper that replaced the U-Net with a pure transformer for image diffusion; the foundation
- Patchification of latent video — take the 3D-VAE latents
(T', H', W', C), patchify into a 1D sequence of spatiotemporal tokens - AdaLN-Zero — the modulation scheme that DiT uses for conditioning; surprisingly robust
- 3D RoPE (Rotary Position Embedding) — extends 2D RoPE to time; the standard now
- Sora's "patches" design — patches at variable size, allowing flexible resolution and aspect ratio at inference
- Rectified Flow / Flow Matching — modern replacement for DDPM training that's better-behaved at scale (used by SD3, Flux, and most 2024+ video models)
- MMDiT (Multi-Modal DiT) — the SD3 architecture: text and image tokens share attention layers; extended to video in Movie Gen and similar
- Open-weights frontier (these move every few months; check before assuming a leader):
- Wan 2.1 / 2.2 (Alibaba) — among the strongest open releases; broad ecosystem of LoRAs and control adapters
- HunyuanVideo (Tencent) — large-scale open release with a strong VAE
- CogVideoX (THUDM) — Tsinghua's open DiT + 3D VAE; a clean reference implementation
- Mochi 1 (Genmo) — high-quality open with an aggressive VAE
- LTX-Video (Lightricks) — designed for near-real-time generation; good for latency experiments
- OpenSora (HPC-AI Tech) and Open-Sora-Plan (PKU) — full open Sora-style replicas, well-documented for learning
- Closed frontier (capabilities and names change fast):
- Sora / Sora 2 (OpenAI) — Sora 2 adds synchronized audio and stronger physical consistency
- Veo 2 / Veo 3 (Google DeepMind) — Veo 3 generates native synchronized audio (dialogue, SFX), a notable shift
- Movie Gen (Meta) — the most detailed open description of a frontier-scale recipe, including joint audio
- Kling 2.x, Hailuo / MiniMax, Runway Gen-4, Luma Dream Machine, Pika — commercial offerings
Sora-Style Architecture, Sketched
Text prompt ──► T5 / CLIP text encoder ──► text tokens
Video latent: (T'=30, H'=90, W'=128, C=16) from 3D VAE
│
▼ patchify (e.g., 2×2×2 patches)
Patches: 15 × 45 × 64 = 43,200 video tokens, each of dim P²×P×C×P → projected to D
│
▼ concat or cross-attend with text tokens
│
▼ MANY transformer blocks (e.g., 40–80 blocks, hidden dim 1500+)
│ each with: 3D-RoPE, self-attention over all video+text tokens,
│ AdaLN modulation from timestep & conditioning, MLP
│
▼ predict noise (or velocity, for rectified flow)
│
▼ DDPM/flow-matching loss on the (B, N_tokens, D) prediction
At inference:
start from Gaussian noise in latent space
denoise over ~30–50 steps (flow matching with few steps)
un-patchify, decode through 3D VAE → pixel video
Projects
| Project | Description | Difficulty |
|---|---|---|
| Implement DiT for video | Take a published DiT image implementation; extend to (T, H, W) patches and 3D RoPE; train on a small video dataset | ⭐⭐⭐⭐⭐ |
| Flow matching from scratch | Replace DDPM with rectified flow / flow matching in a small video DiT; compare convergence | ⭐⭐⭐⭐ |
| Read and reproduce OpenSora | Run inference on a pretrained OpenSora checkpoint; modify one component (e.g., the VAE), retrain | ⭐⭐⭐⭐⭐ |
| MMDiT for video | Implement the SD3-style joint text-video attention; verify text adherence improves | ⭐⭐⭐⭐⭐ |
| Variable resolution | Modify your DiT to handle arbitrary (T, H, W) at inference (Sora's claim); test on aspect ratios it didn't see at training | ⭐⭐⭐⭐⭐ |
Key Insight
The shift from U-Net to DiT in image gen took ~18 months to play out fully (~2022→2024). The same shift in video gen is happening now, faster, because the lesson has been learned. Any video model started in 2025 onward almost certainly uses a transformer backbone, latent input, and flow matching. If you're learning the field for the first time, you can largely skip U-Net-based video models — they're being deprecated in real time. Understand them historically; build on DiT.
Resources
- DiT paper (Peebles & Xie, 2022) — foundational
- Sora technical report
- Movie Gen technical report (Meta, 2024)
- CogVideoX paper
- HunyuanVideo paper
- Rectified Flow paper (Liu et al., 2022)
- Flow Matching for Generative Modeling (Lipman et al., 2022)
- OpenSora and Open-Sora-Plan
Phase 7: Conditioning, Control, and Editing
Generating a video is one thing; generating the video you want is another. This phase is about everything that wraps the core model.
Concepts to Learn
- Text conditioning — T5 vs CLIP vs both (Imagen/SD3-style "use two encoders"); long-prompt handling
- Image conditioning — first-frame conditioning (I2V), last-frame, both, keyframes
- Video-to-video — restyling, depth-conditioned, pose-conditioned (ControlNet-Video)
- Camera control — explicit camera pose embeddings (Plücker coordinates) or motion-bucket conditioning
- Motion control — bounding-box trajectories, sparse motion strokes, dense motion maps
- Identity preservation — keeping a specific character or object consistent (DreamBooth-Video, ID-Animator)
- Audio-conditioned video — talking-head models (SadTalker, EMO, V-Express, Hallo), sync to lip motion
- Video editing:
- Inversion-based editing — invert the video into latent noise, edit, denoise
- Token Merging for Video — runtime acceleration
- Rerender / TokenFlow — style transfer with temporal consistency
- Negative prompts for video — what unwanted artifacts you can subtract
A Taxonomy of Video Generation Tasks
INPUT → TASK EXAMPLES
───────────────────────── ──────────────────── ────────────────────────
text → T2V Sora, Veo, Kling
text + image → T+I2V (frame-locked) SVD-XT, Kling I2V
text + first+last frame → keyframe interpolation Frame Genie, Wan-FLF2V
image → I2V (motion only) SVD, animated stills
video + text → V2V restyle Rerender, TokenFlow
video + pose/depth → controlled V2V AnimateAnyone,
ControlNet-Video
audio + image → talking head EMO, Hallo, V-Express
text + camera trajectory → cinematic T2V MotionCtrl, CameraCtrl
text + object trajectory → trajectory-controlled Boximator, DragAnything
Projects
| Project | Description | Difficulty |
|---|---|---|
| Long-prompt handling | Train or fine-tune with T5-XXL prompts (up to 256 tokens); compare against CLIP-L conditioning on adherence | ⭐⭐⭐⭐ |
| ControlNet-Video | Adapt ControlNet to a video diffusion model; condition on depth maps across all frames | ⭐⭐⭐⭐ |
| Camera control | Add Plücker-coordinate camera embeddings; verify pan / zoom / orbit work | ⭐⭐⭐⭐ |
| Talking head | Run EMO or Hallo on a portrait + audio clip; fine-tune for a specific speaker | ⭐⭐⭐⭐ |
| Video inversion + edit | Invert a real clip into latent noise; replace an object via prompt edit | ⭐⭐⭐⭐⭐ |
| LoRA for video | Train a video LoRA on ~50 clips of a specific style or character | ⭐⭐⭐⭐ |
Key Insight
In image generation, ControlNet and its successors made the difference between "generate something cool" and "generate exactly what I want." Video is following the same trajectory but several years behind. The 2026 frontier in video isn't just bigger models — it's better control surfaces: camera trajectories, character consistency across cuts, dialogue lip sync, scene-level keyframe control. Whoever ships the "ControlNet moment" for video at the right level of abstraction defines the next generation of commercial tools.
Resources
- ControlNet paper — for the original idea
- AnimateAnyone paper — character-consistent animation
- MotionCtrl paper
- EMO paper — audio-driven talking heads
- TokenFlow paper — consistent video editing
- Boximator paper
Phase 8: Long-Form and Consistent Video
The hardest open problem in video generation. Today's best models produce 5–10 seconds of beautiful video and then fall apart. Closing the gap to minute-long, story-coherent generation is the active frontier.
Concepts to Learn
- Why long video is hard:
- Compute scales at least linearly with length, usually worse
- Drift: small errors compound; characters morph, scenes contradict themselves
- Memory: ~30s of latent tokens is already in the 100k–1M range — context window pain
- No long paired text-video data at scale
- Sliding-window approaches — generate overlapping clips, blend in latent or pixel space (FreeNoise, Gen-L-Video)
- Hierarchical generation:
- Generate keyframes first, then fill in between
- Storyboard / shot decomposition (think a director's storyboard, not raw video)
- Autoregressive video models — predict the next chunk of frames conditioned on the previous chunk; long but expensive
- Diffusion Forcing — assign each frame its own noise level so a model can denoise and roll out autoregressively at once; the bridge between full-sequence diffusion and next-frame autoregression
- Autoregressive distillation for streaming — distill a bidirectional diffusion teacher into a causal, few-step student that emits frames as it goes (CausVid, Self-Forcing); the current recipe for real-time/infinite-length generation
- Anchor frames / scene tokens — explicit memory of "this character looks like X"
- Streaming generation — emit frames as you generate them (StreamingT2V, CausVid, Self-Forcing)
- Multi-shot / multi-scene — VideoTetris, DreamFactory, MovieDreamer; combine LLM-planned shot lists with per-shot generation
Two Architectural Approaches to Length
A. Sliding window with overlap (post-hoc):
[clip 1: frames 0-15]
[clip 2: frames 8-23] ← 8 frames of overlap, blended
[clip 3: frames 16-31]
...
+ Cheap, works with any existing T2V model
- Long-range coherence is whatever the overlap can carry forward
B. Hierarchical (designed-in):
text → LLM "director" → shot list (S1, S2, S3, ...)
↓ each shot, per-shot:
keyframes → fill-in T2V model → 5-sec clip
↓ stitch shots
+ consistency model to harmonize identity across shots
+ Can in principle produce minutes of coherent story
- Three or four separate models; complex to train and orchestrate
Projects
| Project | Description | Difficulty |
|---|---|---|
| Sliding-window T2V | Take an open T2V model; generate 30 seconds by overlapping 5-sec clips; blend in latent space | ⭐⭐⭐⭐ |
| Keyframe interpolation | Generate 4 keyframes 5 sec apart, then use an I2V or interpolation model to fill in | ⭐⭐⭐⭐ |
| Character consistency | Use a reference-image encoder (IP-Adapter / character LoRA) across multiple shots; measure drift | ⭐⭐⭐⭐⭐ |
| LLM shot planner | Use a small LLM to expand "a knight rescues a princess" into a JSON shot list; generate each shot; evaluate coherence | ⭐⭐⭐⭐⭐ |
| Streaming T2V | Implement chunk-by-chunk generation with a cached KV state across chunks; measure latency vs quality | ⭐⭐⭐⭐⭐ |
Key Insight
Long-form video generation has the same shape as the long-context problem in LLMs three years ago — exciting demos, brittle outputs, no clear winning architecture, and a half-dozen credible bets. Sliding window, hierarchical planning, and autoregressive generation are not converging the way DiT converged for short video. Expect this to be the dominant frontier topic through 2026–2027.
Resources
- FreeNoise paper
- StreamingT2V paper
- Diffusion Forcing (Chen et al., 2024) — per-token noise levels for autoregressive rollout
- CausVid — Causal video distillation (2024) — fast autoregressive/streaming generation
- Self-Forcing (2025) — closing the train/inference gap for autoregressive video
- VideoTetris paper
- MovieDreamer paper
Phase 9: World Models and Interactive Video
Where video generation stops being "I make pretty clips" and becomes "I simulate the world."
Concepts to Learn
- What a world model is — a generative model that, given a state and an action, predicts the next state. A video model conditioned on actions is a world model. This phase owns the generative side; using the model as an environment to learn a policy is RL Phase 6 (Model-Based RL)
- The Dreamer line, in one sentence — Hafner et al.'s DreamerV1/V2/V3 learn a latent world model and train a policy by imagining rollouts in it. We borrow the generative idea (predict the next latent given an action); the policy-learning loop and the RL objective are covered in the RL guide
- Genie, Genie 2 (DeepMind) — playable, action-conditioned video models trained on web video
- GameNGen (Google) — a real-time playable Doom simulation, entirely neural
- GAIA-1 / GAIA-2 (Wayve) — driving world models
- NVIDIA Cosmos — a world-foundation-model platform aimed at training and evaluating embodied/robot policies; the bridge to Robotics
- OASIS / Decart — open neural Minecraft
- Latent action models — inferring actions from unlabeled video (so you can train world models without paired actions)
- Real-time constraints — < 50 ms/frame for interactivity. Forces distillation, caching, or smaller models — the same autoregressive-distillation toolkit as Phase 8
- Connection to physical RL and robotics — world models are policy-rollouts-as-video; the same model can serve as a simulator for an RL agent (RL Phase 6) or as a learned simulator for an embodied policy (Robotics Phase 9)
- Connection to multimodal — a fully general world model is multimodal: text in, video out, with audio, actions, and physics. Joint cross-modal understanding is Multimodal Learning's territory
The World Model Loop
┌─────────────────────────┐
│ │
(state s_t) │ World model │
(action a_t)──────────►│ p(s_{t+1} | s_t, a_t)│──────► (frame s_{t+1})
│ │
└─────────────────────────┘
▲
│
(s_{t+1} fed back as next s_t)
Run this in a loop, with actions from a human (interactive game),
an RL policy (sim-for-RL), or a planner (model-based control).
A world model is a video generator that also takes actions —
or equivalently, a video generator IS a world model when "action"
is the empty string.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Action-conditioned video | Take a small video diffusion model; add a discrete-action input (e.g., 4 game actions); train on a simple game's recorded play | ⭐⭐⭐⭐⭐ |
| GameNGen reproduction (mini) | Train an action-conditioned model on a simpler game (Atari, GridWorld) and play it interactively | ⭐⭐⭐⭐⭐ |
| Latent action inference | Train a model to infer the latent action between two adjacent frames in unlabeled video (Genie-style) | ⭐⭐⭐⭐⭐ |
| World model for RL | Use a learned world model to roll out trajectories; train a policy in the dream (DreamerV3-light) | ⭐⭐⭐⭐⭐ |
| Real-time latency hunt | Distill a 30-fps diffusion video model into a 4-step (or 1-step) consistency model; measure ms/frame | ⭐⭐⭐⭐ |
Key Insight
World models are the convergence point of three lines of research that are usually taught separately: video generation, model-based RL, and simulation. Each of those communities approaches the same object from a different angle — generation people care about visual fidelity, RL people care about action conditioning and rollouts, simulation people care about physical realism. The 2025–2026 frontier is increasingly the same model used in all three roles. This guide owns the visual-fidelity, action-conditioning, and rollout-generation side; the control side lives in RL Phase 6 and Robotics Phase 9. If you've completed those guides and this one, you're well-positioned to work at the intersection.
Resources
- Genie 2 (DeepMind) — start here
- GameNGen paper (Google, 2024) — playable Doom
- NVIDIA Cosmos (2025) — world-foundation-model platform for embodied AI
- DreamerV3 paper — the policy-learning side (see also RL Phase 6)
- GAIA-1 paper — driving world model
- OASIS (Decart) blog
- Latent Action Pretraining (Bruce et al.)
Phase 10: Training at Scale, Evaluation, and Frontier Topics
This last phase is the operational reality of video generation: data, compute, eval, and what's still open.
Training at Scale
- Data sources:
- Public: HD-VILA, WebVid (deprecated), Panda-70M, OpenVid-1M, Koala-36M
- Proprietary: most strong models train on private licensed video libraries
- Caption generation — public video has terrible captions; recaption with a strong VLM (Qwen2-VL, LLaVA, GPT-4o) before training. This is the single highest-leverage data trick
- Aspect-ratio bucketing — train on multiple aspect ratios together for variable-resolution inference
- Clip extraction — scene detection + filtering (motion score, aesthetic score, OCR-text score)
- Curriculum — start at low resolution and short duration, scale up gradually
- Compute: a frontier text-to-video model is on the order of 10²⁵–10²⁶ FLOPs of training; an open replication is 10²³–10²⁴
Evaluation
The evaluation problem in video generation is worse than in image generation, which is already bad.
- Automatic metrics:
- FVD (Fréchet Video Distance) — the standard, but criticized for poor correlation with human judgment
- CLIPScore-Video, VideoCLIP — text-video alignment
- VBench — comprehensive benchmark suite; the closest thing to a standard
- EvalCrafter — open evaluation harness
- Human evaluation — still the gold standard; pairwise comparisons, win rates
- Physical correctness — does water behave like water? Do objects persist when occluded? Largely unmeasured
- Sora's own evaluation criteria mention things like "object permanence" and "world consistency" — these still don't have clean benchmarks
Frontier Topics
- Real-time and streaming video generation — distillation to 1–4 steps, consistency models, autoregressive caching (CausVid, Self-Forcing); LTX-Video-style architectures built for latency. Increasingly the difference between a demo and a product
- Native audio-video joint generation — as of 2025 this has gone from research to product: Veo 3 generates synchronized dialogue and SFX, Sora 2 adds audio, Movie Gen describes a joint recipe. Native AV models are replacing post-hoc dubbing
- Multi-character, multi-scene narratives — see Phase 8
- Physical realism — making fluid behave like fluid, deformable objects deform correctly
- 3D-consistent video — output that's consistent under camera change (videos that can be re-rendered from a new viewpoint); bridges to NeRF / 3D Gaussian Splatting
- Editable / re-renderable output — output something more structured than pixels (e.g., a 3D scene + camera path)
- Safety: deepfake detection, watermarking (e.g., SynthID), content moderation, consent
- Interactive / playable — see Phase 9
- Long-context multimodal video — feeding hours of video to a VLM for understanding; the inverse direction, but adjacent
Projects
| Project | Description | Difficulty |
|---|---|---|
| Run VBench end to end | Evaluate an open T2V model on the full VBench suite; reproduce a leaderboard number | ⭐⭐⭐ |
| Recaption a dataset | Take 100k clips with bad captions, recaption with a strong VLM, train a small model on each — compare quality | ⭐⭐⭐⭐ |
| Aspect-ratio bucketing | Implement bucketed batching for variable aspect ratios; observe quality improvement on portrait/wide test sets | ⭐⭐⭐ |
| Consistency-model distillation | Distill a 50-step video diffusion model into a 4-step student; measure speed and quality loss | ⭐⭐⭐⭐⭐ |
| Watermarking | Add invisible watermarking to your model's outputs; verify with a detector | ⭐⭐⭐⭐ |
| Physical-plausibility probe | Build 50 trick prompts (water flowing uphill, dropped objects floating); evaluate open models | ⭐⭐⭐ |
Key Insight
Video generation in 2026 is where text generation was around 2021 — extraordinary demos, frustrating gap to product, two or three competing architectural bets, and absolutely no consensus on evaluation. The compute frontier is rapidly closing in on the data frontier: training a Sora-class model is no longer compute-impossible for many organizations, but obtaining the licensed long-form video to train it on is now the harder problem. If you're entering the field, the highest-leverage skills are not architecture (it's converging on DiT) — they're data engineering, evaluation, and control surfaces.
Resources
- VBench paper and leaderboard
- EvalCrafter paper
- Movie Gen technical report — the most detailed open description of a frontier-scale system
- Panda-70M dataset
- SynthID for video
Suggested Timeline
| Phase | Duration | Outcome |
|---|---|---|
| 0. Prerequisites | 0–2 weeks | Image diffusion + multimodal foundations solid |
| 1. Foundations | 1 week | Comfortable loading and decoding video data |
| 2. Classical | 1 week | Familiar with the pre-diffusion approaches; skim only |
| 3. I2V | 1–2 weeks | Built or fine-tuned an image-to-video model |
| 4. Video diffusion | 3 weeks | Inflated a 2D U-Net to 3D and trained on small video |
| 5. Latent + VAE | 2–3 weeks | Trained a 3D VAE; diffusion runs in its latent space |
| 6. DiT | 3–4 weeks | Implemented or ran a DiT-based video model; understand flow matching |
| 7. Conditioning | 2 weeks | Added at least two control signals (camera, depth, character) |
| 8. Long-form | 2–3 weeks | Sliding window or hierarchical pipeline working end to end |
| 9. World models | 2–3 weeks | Trained an action-conditioned model; can roll out interactively |
| 10. Scale + eval | Ongoing | Real benchmark evaluation; data pipeline understood |
Total to "comfortable practitioner": ~4–5 months of focused study. Frontier-research-comfortable: closer to a year.
Key Advice
- Don't try pixel-space. Past 64×64 it's wasted compute. Latent space is non-negotiable.
- Inflate first, train from scratch later. Your first video model should reuse pretrained image weights. Going scratch is a frontier-lab activity.
- Joint image-video training. Co-training preserves still-image quality and dramatically helps data efficiency.
- Recaption your data. Web alt-text and YouTube descriptions are terrible. A strong VLM recaptioning your training video is the highest-leverage single change you can make.
- The VAE matters as much as the diffusion model. Bad reconstructions cap your output quality. Spend serious effort here.
- Profile decoding. Most video-training pipelines are bottlenecked on video decoding, not on the GPU. Use
decord, prefer keyframe-aligned sampling, cache when possible. bf16everywhere on Ampere+. Same as elsewhere;float16GradScalers are unnecessary friction.- Aspect ratios matter. Train on multiple bucket ratios; resist the urge to center-crop everything to square.
- Evaluate with a suite. Don't trust a single FVD number. Use VBench plus human eval, and report failures honestly.
- Watch the open-source frontier. OpenSora, CogVideoX, HunyuanVideo, Mochi, Wan — these move every few months. The state of "what an individual researcher can run" changes faster here than anywhere else in ML.
Common Pitfalls to Avoid
- ❌ Trying to train pixel-space diffusion at meaningful resolution
- ❌ Using a 2D VAE per frame and being surprised by temporal flicker
- ❌ Ignoring the VAE and treating it as a fixed black box
- ❌ Training only on video and watching still-image quality collapse
- ❌ Loading video with PIL frame by frame instead of
decord - ❌ Storing decoded frames as fp32 on disk
- ❌ Using CLIP-L for text conditioning when prompts are >77 tokens (use T5)
- ❌ Reporting only FVD with no human eval
- ❌ Trying to generate >10 seconds without a longform strategy
- ❌ Forgetting to validate frame-by-frame consistency, not just per-frame quality
Additional Resources
Books and Long-Form Reading
- Lilian Weng — What are Diffusion Models?
- The Annotated Diffusion Model (Hugging Face)
- Sander Dieleman's blog — best long-form thinking on diffusion
Key Papers, Chronologically
| Year | Paper | Contribution |
|---|---|---|
| 2022 | Video Diffusion Models | First principled paper |
| 2022 | Make-A-Video | Text-conditioned video, image-pretrain trick |
| 2022 | Imagen Video | Cascaded high-res video |
| 2023 | Align Your Latents | Latent video diffusion |
| 2023 | Stable Video Diffusion | Open I2V baseline |
| 2023 | AnimateDiff | Motion module, community SD |
| 2023 | MagViT-v2 | Best discrete video tokenizer |
| 2024 | Sora technical report | DiT + variable patches |
| 2024 | GameNGen | Real-time neural Doom |
| 2024 | CogVideoX | Strong open DiT + VAE |
| 2024 | Movie Gen | Frontier-scale recipe, open description, joint audio |
| 2024 | HunyuanVideo | Large open release |
| 2024 | Genie 2 | Foundation world model |
| 2024 | Diffusion Forcing | Per-token noise levels; AR rollout meets diffusion |
| 2024 | CausVid | Causal distillation for streaming generation |
| 2025 | NVIDIA Cosmos | World-foundation-model platform for embodied AI |
| 2025 | Self-Forcing | Closes the AR train/inference gap; real-time long video |
Tools You Should Know
decord— fast video loadingdiffusers(Hugging Face) — for inference and quick prototypingOpenSora/CogVideoX/HunyuanVideo— open training stacksVBench— evaluation harnesscomfyui— for rapid pipeline prototyping with open modelsffmpeg— you will need it
Communities
- Hugging Face forums and Discord
- r/StableDiffusion — practical, open-source-focused
- r/MachineLearning
- Twitter/X — video gen moves on Twitter faster than anywhere
Quick Start Checklist
- Can load a video clip with
decordand explain frame sampling vs uniform-in-time - Can explain why latent space is mandatory for video generation
- Have run inference on Stable Video Diffusion and AnimateDiff
- Have inflated a 2D U-Net to a 3D (or (2+1)D) model and trained it on small video
- Have trained or used a 3D VAE; understand causal video VAEs
- Have read the Sora technical report end to end
- Have implemented or run a DiT-based video model
- Understand flow matching as well as DDPM
- Can add a control signal (camera, depth, pose) to a video model
- Have generated >10 sec of video with a longform strategy
- Have evaluated a model on VBench (or a substantial subset)
- Have at least skimmed an action-conditioned world model paper (Genie 2 or GameNGen)
License
MIT License. See the LICENSE file for details.