Skip to main content

Video Generation: From Beginner to Advanced

A comprehensive guide to understanding and building video generation systems — from the fundamentals of treating video as a spatiotemporal signal, through latent video diffusion and Diffusion Transformers (DiT), to long-form generation, world models, and the frontier of real-time interactive video.

Video generation = image generation + time. That one sentence is both true and dangerously misleading. The "+ time" introduces problems that have no image-gen analog: temporal consistency, motion priors, enormous compute (a 5-second 720p clip is ~150 images), and the brutal scarcity of high-quality paired video-text data. This guide is about how the field solved (and is still solving) those problems.

Scope and boundaries

This guide owns the generative modeling of video — the moment you add a time axis to image generation and have to model motion, temporal consistency, and the compute explosion that comes with both. To keep the AI Learning Guides mutually exclusive and collectively exhaustive (MECE), it deliberately stops at a few borders and links forward to the guide that owns each one.

In scope — this guide owns these topics:

  • Video as a spatiotemporal signal — shapes, frame rate, codecs, the cost model, and why latent compression is non-negotiable
  • The time axis on top of diffusion — temporal layers, (2+1)D vs full spatiotemporal attention, temporal inflation of pretrained image models
  • 3D / causal video VAEs and discrete video tokenizers — the compressors that make video diffusion tractable
  • Video DiTs and Sora-class models — patchified latent video, 3D RoPE, flow matching applied to video
  • Image-to-video, video-to-video, and video-specific control — first-frame/keyframe conditioning, camera and motion control, talking heads, video editing
  • Long-form and consistent video — sliding-window, hierarchical, autoregressive, and streaming generation
  • Generative world models and interactive/playable video — action-conditioned video as a simulator (Genie, GameNGen, driving/embodied world models), from the generation side
  • Native audio-video joint generation, video-data engineering, and video evaluation — the parts that differ from the image recipe

Out of scope — deferred to the owning guide:

  • Diffusion/VAE/GAN/flow fundamentals, U-Net and DiT mechanics, score/EDM theory, image tokenizers (VQ-VAE, FSQ, LFQ)Image Generation. This guide assumes all of it and only adds the time axis. If DDPM, latent diffusion, classifier-free guidance, or flow matching feel fuzzy, fix that there first.
  • CLIP/T5 text-encoder training, VLMs, any-to-any models, and video understandingMultimodal Learning. We use a frozen text encoder and recaption with a VLM, but training those, and encoding video for joint reasoning, lives there. We own video synthesis; they own video understanding.
  • Model-based RL, the Dreamer policy-learning loop, planning/MPC in a learned model, and training a policy "in the dream"RL Phase 6 and RL Phase 10. We build the action-conditioned video generator; using it as an environment for control is theirs.
  • Vision-Language-Action robot policies, sim-to-real, and embodied controlRobotics Phase 8 and Robotics Phase 9.
  • Serving, batching, and inference-latency engineering for deployed video models → Inference Systems; kernel-level performance and quantizationAI Hardware. We discuss step-distillation and real-time generation as modeling problems and link out for the systems side.
  • Tensor, autograd, mixed-precision, distributed-training, and training-loop fundamentalsPyTorch Deep Dive.

When this guide touches an out-of-scope topic, it does so only to the depth needed to make a video-generation modeling decision, and it links to the owning guide.


Table of Contents

  1. Phase 0: Prerequisites
  2. Phase 1: Foundations — Video as a Tensor
  3. Phase 2: Classical and Early Neural Video Generation
  4. Phase 3: Image-to-Video as a Stepping Stone
  5. Phase 4: Video Diffusion — The Modern Foundation
  6. Phase 5: Latent Video Diffusion and Video Tokenizers
  7. Phase 6: Diffusion Transformers (DiT) and Sora-Class Models
  8. Phase 7: Conditioning, Control, and Editing
  9. Phase 8: Long-Form and Consistent Video
  10. Phase 9: World Models and Interactive Video
  11. Phase 10: Training at Scale, Evaluation, and Frontier Topics
  12. Suggested Timeline
  13. Key Advice
  14. Common Pitfalls to Avoid
  15. Additional Resources
  16. Glossary

Phase 0: Prerequisites

Video generation is one of the most demanding topics in modern ML. The prerequisites are non-negotiable — and unusually, they are almost entirely owned by other guides in this collection. This guide adds the time axis; it assumes the rest.

Concepts to Know

The single most important prerequisite is image diffusion. Work through Image Generation Phases 5–8 before starting here; nearly everything below is "that, with a time axis." Specifically you should be fluent in:

  • Diffusion models (Image Gen Phase 5): forward/reverse process, DDPM, DDIM, classifier-free guidance, noise schedules
  • Score/EDM and flow matching (Image Gen Phase 6 and Phase 8): the σ-parameterization, rectified flow — most 2024+ video models train this way
  • Latent diffusion (Image Gen Phase 7): VAE encoder/decoder, training a diffusion model in latent space (i.e., Stable Diffusion)
  • DiT and image tokenizers (Image Gen Phase 3 and Phase 8): patchification, AdaLN-Zero, VQ-VAE / FSQ / LFQ
  • U-Net architecture: down/up blocks, skip connections, attention blocks
  • Transformers and ViT: self-attention, cross-attention, positional embeddings, patchification, 1D-sequence treatment of images
  • Text conditioning (a frozen CLIP/T5 encoder here; training one is Multimodal Learning's job): cross-attention for text→image
  • PyTorch fluency (PyTorch Deep Dive): mixed precision, distributed training (DDP/FSDP), memory profiling
  • Optical flow (helpful): what it is and why it shows up everywhere in video

The One Equation Everything Comes Back To

A video is a tensor of shape (T, H, W, C) — frames × height × width × channels.

Modern video generation models a distribution over this tensor:
p(x_video | text, image, audio, ...)

The dominant approach today: tokenize the video into a (T', H', W') latent
grid with a 3D VAE, then either
(a) run diffusion in that latent space (Sora, Veo, MovieGen), or
(b) autoregress next-token in that latent space (CogVideo, Phenaki),
(c) or a hybrid.

The single hardest problem isn't the model — it's getting (T, H, W) all
big enough to be useful without compute exploding cubically.

Resources


Phase 1: Foundations — Video as a Tensor

Before models, understand the data. Video has properties that images don't, and they shape every architectural decision later.

Concepts to Learn

  • Video shape conventions: (B, T, C, H, W) (PyTorch) vs (B, C, T, H, W) (3D conv-friendly) — both common, easy to confuse
  • Frame rate (fps) — 24, 25, 30, 60; the same motion at different fps looks very different to a model
  • Video codecs: H.264, H.265/HEVC, AV1, VP9 — most public video is heavily compressed; this matters
  • Color spaces: YUV420 (native to most codecs) vs RGB (what your model wants)
  • Containers vs codecs: .mp4, .mov, .webm are containers; H.264, AV1 are the codecs inside them
  • Temporal redundancy: adjacent frames are nearly identical — both a problem (waste) and an opportunity (compression)
  • Motion as a signal: optical flow, motion vectors (already inside the codec), scene cuts
  • Data loading is brutal: a 1-minute 1080p clip is gigabytes uncompressed; decode-on-the-fly is mandatory

The Cost of a Single Clip

Resolution × fps × duration → raw tensor size

256×256, 8 fps, 2 sec → 16 frames × 256 × 256 × 3 = 3.1 MB (one clip!)
512×512, 24 fps, 5 sec → 120 frames × 512 × 512 × 3 = 94 MB
720p, 24 fps, 5 sec → 120 × 1280 × 720 × 3 = 333 MB
1080p, 24 fps, 10 sec → 240 × 1920 × 1080 × 3 = 1.5 GB

(All in fp32; halve for fp16 / bf16.)

→ A batch size of 8 at 1080p × 10s is 12 GB just for inputs.
This is why every video model uses a latent VAE.

Projects

ProjectDescriptionDifficulty
Video loader benchmarkCompare torchvision.io, decord, pyav, and ffmpeg-python on a folder of .mp4s; report decode time per clip⭐⭐
Frame extractorSample N frames evenly from a clip; sample N frames at uniform fps; observe the difference for fast vs slow scenes⭐⭐
Optical flow visualizerCompute dense optical flow (RAFT, Farnebäck) between adjacent frames; color-visualize⭐⭐
Scene-cut detectorDetect scene boundaries via histogram or feature distance; split a movie into clips⭐⭐
Storage studyTake 100 clips, store as raw .npy, H.264 .mp4, and AV1 .webm; compare disk and decode speed⭐⭐

Sample Code: Loading a Video Clip with decord

import decord
import torch
from decord import VideoReader

decord.bridge.set_bridge("torch") # decode directly to torch tensors

vr = VideoReader("input.mp4", num_threads=2)
fps = vr.get_avg_fps()
total = len(vr)

# Sample 16 frames uniformly across the clip:
indices = torch.linspace(0, total - 1, 16).long().tolist()
frames = vr.get_batch(indices) # (16, H, W, 3), uint8

# Convert to (T, C, H, W) float in [-1, 1] for model input:
frames = frames.permute(0, 3, 1, 2).float() / 127.5 - 1.0

Key Insight

Every operational decision in video generation — frame rate, clip length, resolution, batch size — is a compute-vs-quality trade-off, and they all multiply. Doubling resolution = 4× compute. Doubling frame count = 2× compute. Doubling batch size = 2× compute. Doubling all three = 16×. This is why the field obsesses over latent compression and why nearly every published video model lists its exact (T, H, W) operating point as a design parameter, not an afterthought.

Resources


Phase 2: Classical and Early Neural Video Generation

The history matters — it's where you learn what doesn't work and why. Skim, don't memorize.

Concepts to Learn

  • Frame interpolation — generating intermediate frames between two real ones (FILM, Super SloMo); a "video generation lite"
  • Future frame prediction — given a few frames, predict the next ones (early benchmark: Moving MNIST)
  • Video GANs:
    • VGAN, TGAN — early attempts, low quality
    • MoCoGAN — disentangled motion and content
    • DVD-GAN — first plausible-quality short clips
    • StyleGAN-V — applied StyleGAN's latent space to video
  • Autoregressive pixel models: VideoPixelNetwork, slow but principled
  • Recurrent approaches: ConvLSTM, PredRNN — used widely before transformers won
  • The limits of these approaches: short, low-resolution, no text conditioning, mode collapse for GANs

Why These Mostly Stopped

Around 2022 the field made three near-simultaneous moves that made
older approaches obsolete:

1. Diffusion proved itself on images (DDPM → Imagen, Stable Diffusion).
2. Latent compression made it tractable for high resolution.
3. Text-image pretraining produced strong text conditioning for free.

Video inherited all three. GAN-based and pure-recurrent video generation
have not seriously competed with diffusion since ~2023.

Projects

ProjectDescriptionDifficulty
Moving MNIST predictorTrain a ConvLSTM to predict the next 10 frames given 10; classic baseline⭐⭐⭐
FILM frame interpolationUse a pretrained FILM to interpolate between two real frames; observe motion artifacts⭐⭐
Tiny video GANTrain a small video GAN on UCF-101 face crops — observe mode collapse firsthand⭐⭐⭐⭐
Read MoCoGANImplement just the latent-disentanglement idea (content + motion latents) in a small VAE⭐⭐⭐

Key Insight

The pre-diffusion era of video generation is a graveyard of clever ideas that didn't scale. Most of them — disentangled motion latents, hierarchical generation, two-stream architectures — have since reappeared as components inside diffusion-based systems. The ideas were right; the training framework was wrong.

Resources


Phase 3: Image-to-Video as a Stepping Stone

Before generating video from scratch, generate video from an image. This is the conceptually simplest version of the problem and the most practical to start training on.

Concepts to Learn

  • The image-to-video (I2V) task — given one frame, produce a clip starting from it
  • Conditioning on a still image: concatenate, cross-attend, or AdaLN modulation
  • Motion buckets / motion scores — letting the user control "how much motion"
  • Camera control — explicit camera trajectory as a side input (CameraCtrl, MotionCtrl)
  • The two main outputs of an I2V model: short clips (2–5 sec) and animated stills (subtle motion, longer)
  • Stable Video Diffusion (SVD) — the canonical open-weights I2V model; freezes a pretrained image latent diffusion model and adds temporal layers
  • AnimateDiff — adds a "motion module" to any community Stable Diffusion checkpoint without retraining the base

Why I2V Is Easier Than T2V

Text-to-video (T2V): text → video (no anchor; must invent everything)
Image-to-video (I2V): image → video (first frame fixes appearance,
model only models motion)
Video-to-video (V2V): video → video (style transfer / restyling)

I2V's training signal is also cheaper: any video is automatically a
training example — first frame is the condition, the rest is the target.
No paired text needed.

Projects

ProjectDescriptionDifficulty
Run SVD inferenceGenerate 14-frame and 25-frame clips with Stable Video Diffusion from arbitrary images⭐⭐
AnimateDiff tourPlug AnimateDiff's motion module into a community SD 1.5 checkpoint; generate animated stills⭐⭐⭐
Tiny I2V modelAdd 3D temporal conv layers to a frozen SD 1.5 U-Net; fine-tune on 100k clips with the first frame as condition⭐⭐⭐⭐⭐
Motion controlTrain the above with a motion-score input; verify that low scores produce subtle motion⭐⭐⭐⭐
Camera trajectoryAdd Plücker-coordinate camera embeddings to an I2V model; verify pan/zoom controllability⭐⭐⭐⭐⭐

Sample Code: Inflating a 2D Conv to a (2+1)D Conv

import torch
import torch.nn as nn

class Conv2Plus1D(nn.Module):
"""Common pattern: factorize a 3D conv into spatial + temporal."""
def __init__(self, in_c, out_c, k_s=3, k_t=3):
super().__init__()
self.spatial = nn.Conv3d(in_c, out_c, kernel_size=(1, k_s, k_s),
padding=(0, k_s // 2, k_s // 2))
self.temporal = nn.Conv3d(out_c, out_c, kernel_size=(k_t, 1, 1),
padding=(k_t // 2, 0, 0))

def forward(self, x):
# x: (B, C, T, H, W)
return self.temporal(self.spatial(x))

# Init temporal as identity (zeros + identity in middle) so a 2D-pretrained
# model passes video through unchanged at the start of training:
def init_temporal_as_identity(conv):
nn.init.zeros_(conv.weight)
middle = conv.kernel_size[0] // 2
for c in range(min(conv.in_channels, conv.out_channels)):
conv.weight.data[c, c, middle, 0, 0] = 1.0

Key Insight

The dominant pattern across nearly all 2022–2024 video diffusion models is temporal inflation: take a pretrained image model, insert temporal layers initialized as identity, and fine-tune. This preserves the pretrained spatial knowledge while learning motion on top. AnimateDiff, ModelScope, Stable Video Diffusion, and Make-A-Video all use variants of this trick. The 2024–2026 frontier (Sora, Veo, Movie Gen) abandons it in favor of training spatiotemporal models from scratch — but the inflation pattern is still the right starting point for any custom model.

Resources


Phase 4: Video Diffusion — The Modern Foundation

This is where the field is. Master this phase deeply; the next two are refinements.

Concepts to Learn

  • Pixel-space vs latent-space video diffusion — pixel space is impractical at any meaningful resolution; latent space is the default
  • 3D U-Nets — the natural generalization of 2D U-Nets to (T, H, W)
  • (2+1)D factorization — separate spatial and temporal layers; cheaper and easier to initialize from 2D pretrained weights
  • Temporal attention — pure attention along the time axis at each spatial position; the modern default for high-quality models
  • Spatiotemporal attention — joint attention over (T × H × W); quadratic in sequence length and very expensive
  • Video-CFG: classifier-free guidance for video; balancing text alignment against temporal coherence
  • Cascaded diffusion for video: low-res video → super-resolution → frame interpolation (Imagen Video, Make-A-Video used this; modern models do it less)
  • Noise schedules for video — empirically need lower SNR (more noise) than images at the same resolution
  • Joint image-video training — co-train on still images (treated as 1-frame video) to maintain image quality

The 3D U-Net Block

Input: (B, C, T, H, W)

┌─────────────────────────────────────────┐
│ Spatial conv (1×3×3) ──┐ │ inflated 2D conv
│ Spatial self-attention ──┤ │ shared with image weights
│ Cross-attention (text) ──┘ │
│ │
│ Temporal conv (3×1×1) ──┐ │
│ Temporal self-attention ──┤ │ new, initialized as identity
│ ──┘ │
└─────────────────────────────────────────┘

Modern variant: replace all "conv" with "transformer block" → DiT (Phase 6).

Projects

ProjectDescriptionDifficulty
Inflate SD to a video modelTake a Stable Diffusion 1.5 U-Net, inflate to 3D (insert temporal conv + temporal attention), train on a small video dataset⭐⭐⭐⭐⭐
Joint image-video trainingCo-train your inflated model on 90% images, 10% video; compare to video-only training on quality and motion⭐⭐⭐⭐
Temporal CFG studyVary CFG strength independently for text and for image conditioning; observe trade-offs⭐⭐⭐
Cascaded super-resolutionBuild a small "low-res video → high-res video" diffusion super-resolution model⭐⭐⭐⭐
Compare attention patterns(2+1)D vs full spatiotemporal vs windowed spatiotemporal; measure FLOPs and quality⭐⭐⭐⭐

Sample Code: A (2+1)D Transformer Block for Video

import torch
import torch.nn as nn
from einops import rearrange

class Video2Plus1DBlock(nn.Module):
def __init__(self, dim, n_heads):
super().__init__()
self.spatial_attn = nn.MultiheadAttention(dim, n_heads, batch_first=True)
self.temporal_attn = nn.MultiheadAttention(dim, n_heads, batch_first=True)
self.norm_s = nn.LayerNorm(dim)
self.norm_t = nn.LayerNorm(dim)
# Zero-init the temporal-attention output projection so the model
# behaves as a still-image model at initialization:
nn.init.zeros_(self.temporal_attn.out_proj.weight)
nn.init.zeros_(self.temporal_attn.out_proj.bias)

def forward(self, x):
# x: (B, T, S, D) where S = H*W spatial positions
B, T, S, D = x.shape

# Spatial: each frame independently attends within itself
h = rearrange(x, "b t s d -> (b t) s d")
h_norm = self.norm_s(h)
h = h + self.spatial_attn(h_norm, h_norm, h_norm, need_weights=False)[0]

# Temporal: each spatial position attends across time
h = rearrange(h, "(b t) s d -> (b s) t d", b=B, t=T)
h_norm = self.norm_t(h)
h = h + self.temporal_attn(h_norm, h_norm, h_norm, need_weights=False)[0]

return rearrange(h, "(b s) t d -> b t s d", b=B, s=S)

Key Insight

The single biggest design lever in video diffusion is what gets attention along the time axis. Pure (2+1)D — spatial attention then separate temporal attention — is cheap and works surprisingly well, which is why it dominated 2023–2024. Full spatiotemporal attention is much more expressive but quadratic in T×H×W, which gets prohibitive fast. Modern Sora-class models pay this cost using 3D latent compression to shrink T×H×W aggressively before attention runs. The trick is moving the expense from attention into the VAE.

Resources


Phase 5: Latent Video Diffusion and Video Tokenizers

The single most important enabler of modern video generation. If you understand the VAE in image-gen, this is the natural extension — but the engineering is much harder.

Concepts to Learn

  • Why latent space is non-negotiable — see Phase 1's storage table
  • 2D VAEs for video — run a 2D image VAE per frame; works, but no temporal compression
  • 3D VAEs for video — compress in time as well as space; the modern default. Typical compression ratios: 4× temporal, 8× spatial → 32–128× total
  • Causal 3D VAEs — first frame encoded with itself only, later frames encoded with causal context. Lets the same model handle still images and video
  • Reconstruction quality matters more than for images — temporal flicker in the VAE shows up directly as motion artifacts
  • Discrete vs continuous latents:
    • Continuous (VAE) for diffusion models
    • Discrete (VQ-VAE, FSQ, LFQ) for autoregressive/transformer-style models — MagViT-v2 is the strongest open recipe
  • Joint training with images — same caveat as the U-Net case; helps preserve still-image quality
  • Two-stage training: train the VAE first, freeze it, then train the diffusion model in its latent space

Latent Compression in Numbers

Raw clip: 120 frames × 720p × 3 channels = 333 MB

After 3D VAE:
Spatial: 8×8 compression → 90×128 per frame
Temporal: 4× compression → 30 frames
Channels: e.g., 16 → (30, 90, 128, 16) ≈ 21 MB

That's 16× less data — and crucially, diffusion now runs over 30 latent
"frames" instead of 120. Memory and compute both drop dramatically.

Projects

ProjectDescriptionDifficulty
Frame-by-frame 2D VAEUse Stable Diffusion's VAE on video frames independently; observe temporal flicker in reconstructions⭐⭐
Train a small 3D VAE(B, 3, T, H, W) → (B, C, T', H', W'); compress 4× in time, 8× in space; train on UCF-101⭐⭐⭐⭐⭐
Causal 3D VAEModify the above to causal in time so it handles single images correctly (T=1 → T'=1)⭐⭐⭐⭐
MagViT-v2-style tokenizerTrain a discrete video tokenizer using FSQ or LFQ quantization; measure reconstruction FID⭐⭐⭐⭐⭐
Diffusion on latentsPlug the 3D VAE in front of a small diffusion model from Phase 4; compare training speed and quality to pixel-space⭐⭐⭐⭐

Key Insight

The 3D VAE is the unsung hero of modern video generation. Sora's "patches" — its much-discussed innovation — are just patches in the latent space of a 3D VAE. The trick is that a powerful enough VAE compresses video by ~100× while preserving the information that matters for generation, so the diffusion model can train on what used to be a 100-GB clip as if it were a 1-GB clip. Every "Sora-class" model has spent serious effort on its VAE; the best ones have spent at least as much effort there as on the diffusion backbone.

Resources


Phase 6: Diffusion Transformers (DiT) and Sora-Class Models

The current frontier. As of 2026, the strongest video models are all DiT-based, trained on latent video tokens, with text conditioning via cross-attention or token concatenation.

Concepts to Learn

  • DiT (Diffusion Transformer) — Peebles & Xie's paper that replaced the U-Net with a pure transformer for image diffusion; the foundation
  • Patchification of latent video — take the 3D-VAE latents (T', H', W', C), patchify into a 1D sequence of spatiotemporal tokens
  • AdaLN-Zero — the modulation scheme that DiT uses for conditioning; surprisingly robust
  • 3D RoPE (Rotary Position Embedding) — extends 2D RoPE to time; the standard now
  • Sora's "patches" design — patches at variable size, allowing flexible resolution and aspect ratio at inference
  • Rectified Flow / Flow Matching — modern replacement for DDPM training that's better-behaved at scale (used by SD3, Flux, and most 2024+ video models)
  • MMDiT (Multi-Modal DiT) — the SD3 architecture: text and image tokens share attention layers; extended to video in Movie Gen and similar
  • Open-weights frontier (these move every few months; check before assuming a leader):
    • Wan 2.1 / 2.2 (Alibaba) — among the strongest open releases; broad ecosystem of LoRAs and control adapters
    • HunyuanVideo (Tencent) — large-scale open release with a strong VAE
    • CogVideoX (THUDM) — Tsinghua's open DiT + 3D VAE; a clean reference implementation
    • Mochi 1 (Genmo) — high-quality open with an aggressive VAE
    • LTX-Video (Lightricks) — designed for near-real-time generation; good for latency experiments
    • OpenSora (HPC-AI Tech) and Open-Sora-Plan (PKU) — full open Sora-style replicas, well-documented for learning
  • Closed frontier (capabilities and names change fast):
    • Sora / Sora 2 (OpenAI) — Sora 2 adds synchronized audio and stronger physical consistency
    • Veo 2 / Veo 3 (Google DeepMind) — Veo 3 generates native synchronized audio (dialogue, SFX), a notable shift
    • Movie Gen (Meta) — the most detailed open description of a frontier-scale recipe, including joint audio
    • Kling 2.x, Hailuo / MiniMax, Runway Gen-4, Luma Dream Machine, Pika — commercial offerings

Sora-Style Architecture, Sketched

Text prompt ──► T5 / CLIP text encoder ──► text tokens

Video latent: (T'=30, H'=90, W'=128, C=16) from 3D VAE

▼ patchify (e.g., 2×2×2 patches)
Patches: 15 × 45 × 64 = 43,200 video tokens, each of dim P²×P×C×P → projected to D

▼ concat or cross-attend with text tokens

▼ MANY transformer blocks (e.g., 40–80 blocks, hidden dim 1500+)
│ each with: 3D-RoPE, self-attention over all video+text tokens,
│ AdaLN modulation from timestep & conditioning, MLP

▼ predict noise (or velocity, for rectified flow)

▼ DDPM/flow-matching loss on the (B, N_tokens, D) prediction

At inference:
start from Gaussian noise in latent space
denoise over ~30–50 steps (flow matching with few steps)
un-patchify, decode through 3D VAE → pixel video

Projects

ProjectDescriptionDifficulty
Implement DiT for videoTake a published DiT image implementation; extend to (T, H, W) patches and 3D RoPE; train on a small video dataset⭐⭐⭐⭐⭐
Flow matching from scratchReplace DDPM with rectified flow / flow matching in a small video DiT; compare convergence⭐⭐⭐⭐
Read and reproduce OpenSoraRun inference on a pretrained OpenSora checkpoint; modify one component (e.g., the VAE), retrain⭐⭐⭐⭐⭐
MMDiT for videoImplement the SD3-style joint text-video attention; verify text adherence improves⭐⭐⭐⭐⭐
Variable resolutionModify your DiT to handle arbitrary (T, H, W) at inference (Sora's claim); test on aspect ratios it didn't see at training⭐⭐⭐⭐⭐

Key Insight

The shift from U-Net to DiT in image gen took ~18 months to play out fully (~2022→2024). The same shift in video gen is happening now, faster, because the lesson has been learned. Any video model started in 2025 onward almost certainly uses a transformer backbone, latent input, and flow matching. If you're learning the field for the first time, you can largely skip U-Net-based video models — they're being deprecated in real time. Understand them historically; build on DiT.

Resources


Phase 7: Conditioning, Control, and Editing

Generating a video is one thing; generating the video you want is another. This phase is about everything that wraps the core model.

Concepts to Learn

  • Text conditioning — T5 vs CLIP vs both (Imagen/SD3-style "use two encoders"); long-prompt handling
  • Image conditioning — first-frame conditioning (I2V), last-frame, both, keyframes
  • Video-to-video — restyling, depth-conditioned, pose-conditioned (ControlNet-Video)
  • Camera control — explicit camera pose embeddings (Plücker coordinates) or motion-bucket conditioning
  • Motion control — bounding-box trajectories, sparse motion strokes, dense motion maps
  • Identity preservation — keeping a specific character or object consistent (DreamBooth-Video, ID-Animator)
  • Audio-conditioned video — talking-head models (SadTalker, EMO, V-Express, Hallo), sync to lip motion
  • Video editing:
    • Inversion-based editing — invert the video into latent noise, edit, denoise
    • Token Merging for Video — runtime acceleration
    • Rerender / TokenFlow — style transfer with temporal consistency
  • Negative prompts for video — what unwanted artifacts you can subtract

A Taxonomy of Video Generation Tasks

INPUT → TASK EXAMPLES
───────────────────────── ──────────────────── ────────────────────────
text → T2V Sora, Veo, Kling
text + image → T+I2V (frame-locked) SVD-XT, Kling I2V
text + first+last frame → keyframe interpolation Frame Genie, Wan-FLF2V
image → I2V (motion only) SVD, animated stills
video + text → V2V restyle Rerender, TokenFlow
video + pose/depth → controlled V2V AnimateAnyone,
ControlNet-Video
audio + image → talking head EMO, Hallo, V-Express
text + camera trajectory → cinematic T2V MotionCtrl, CameraCtrl
text + object trajectory → trajectory-controlled Boximator, DragAnything

Projects

ProjectDescriptionDifficulty
Long-prompt handlingTrain or fine-tune with T5-XXL prompts (up to 256 tokens); compare against CLIP-L conditioning on adherence⭐⭐⭐⭐
ControlNet-VideoAdapt ControlNet to a video diffusion model; condition on depth maps across all frames⭐⭐⭐⭐
Camera controlAdd Plücker-coordinate camera embeddings; verify pan / zoom / orbit work⭐⭐⭐⭐
Talking headRun EMO or Hallo on a portrait + audio clip; fine-tune for a specific speaker⭐⭐⭐⭐
Video inversion + editInvert a real clip into latent noise; replace an object via prompt edit⭐⭐⭐⭐⭐
LoRA for videoTrain a video LoRA on ~50 clips of a specific style or character⭐⭐⭐⭐

Key Insight

In image generation, ControlNet and its successors made the difference between "generate something cool" and "generate exactly what I want." Video is following the same trajectory but several years behind. The 2026 frontier in video isn't just bigger models — it's better control surfaces: camera trajectories, character consistency across cuts, dialogue lip sync, scene-level keyframe control. Whoever ships the "ControlNet moment" for video at the right level of abstraction defines the next generation of commercial tools.

Resources


Phase 8: Long-Form and Consistent Video

The hardest open problem in video generation. Today's best models produce 5–10 seconds of beautiful video and then fall apart. Closing the gap to minute-long, story-coherent generation is the active frontier.

Concepts to Learn

  • Why long video is hard:
    • Compute scales at least linearly with length, usually worse
    • Drift: small errors compound; characters morph, scenes contradict themselves
    • Memory: ~30s of latent tokens is already in the 100k–1M range — context window pain
    • No long paired text-video data at scale
  • Sliding-window approaches — generate overlapping clips, blend in latent or pixel space (FreeNoise, Gen-L-Video)
  • Hierarchical generation:
    • Generate keyframes first, then fill in between
    • Storyboard / shot decomposition (think a director's storyboard, not raw video)
  • Autoregressive video models — predict the next chunk of frames conditioned on the previous chunk; long but expensive
  • Diffusion Forcing — assign each frame its own noise level so a model can denoise and roll out autoregressively at once; the bridge between full-sequence diffusion and next-frame autoregression
  • Autoregressive distillation for streaming — distill a bidirectional diffusion teacher into a causal, few-step student that emits frames as it goes (CausVid, Self-Forcing); the current recipe for real-time/infinite-length generation
  • Anchor frames / scene tokens — explicit memory of "this character looks like X"
  • Streaming generation — emit frames as you generate them (StreamingT2V, CausVid, Self-Forcing)
  • Multi-shot / multi-scene — VideoTetris, DreamFactory, MovieDreamer; combine LLM-planned shot lists with per-shot generation

Two Architectural Approaches to Length

A. Sliding window with overlap (post-hoc):
[clip 1: frames 0-15]
[clip 2: frames 8-23] ← 8 frames of overlap, blended
[clip 3: frames 16-31]
...

+ Cheap, works with any existing T2V model
- Long-range coherence is whatever the overlap can carry forward

B. Hierarchical (designed-in):
text → LLM "director" → shot list (S1, S2, S3, ...)
↓ each shot, per-shot:
keyframes → fill-in T2V model → 5-sec clip
↓ stitch shots
+ consistency model to harmonize identity across shots

+ Can in principle produce minutes of coherent story
- Three or four separate models; complex to train and orchestrate

Projects

ProjectDescriptionDifficulty
Sliding-window T2VTake an open T2V model; generate 30 seconds by overlapping 5-sec clips; blend in latent space⭐⭐⭐⭐
Keyframe interpolationGenerate 4 keyframes 5 sec apart, then use an I2V or interpolation model to fill in⭐⭐⭐⭐
Character consistencyUse a reference-image encoder (IP-Adapter / character LoRA) across multiple shots; measure drift⭐⭐⭐⭐⭐
LLM shot plannerUse a small LLM to expand "a knight rescues a princess" into a JSON shot list; generate each shot; evaluate coherence⭐⭐⭐⭐⭐
Streaming T2VImplement chunk-by-chunk generation with a cached KV state across chunks; measure latency vs quality⭐⭐⭐⭐⭐

Key Insight

Long-form video generation has the same shape as the long-context problem in LLMs three years ago — exciting demos, brittle outputs, no clear winning architecture, and a half-dozen credible bets. Sliding window, hierarchical planning, and autoregressive generation are not converging the way DiT converged for short video. Expect this to be the dominant frontier topic through 2026–2027.

Resources


Phase 9: World Models and Interactive Video

Where video generation stops being "I make pretty clips" and becomes "I simulate the world."

Concepts to Learn

  • What a world model is — a generative model that, given a state and an action, predicts the next state. A video model conditioned on actions is a world model. This phase owns the generative side; using the model as an environment to learn a policy is RL Phase 6 (Model-Based RL)
  • The Dreamer line, in one sentence — Hafner et al.'s DreamerV1/V2/V3 learn a latent world model and train a policy by imagining rollouts in it. We borrow the generative idea (predict the next latent given an action); the policy-learning loop and the RL objective are covered in the RL guide
  • Genie, Genie 2 (DeepMind) — playable, action-conditioned video models trained on web video
  • GameNGen (Google) — a real-time playable Doom simulation, entirely neural
  • GAIA-1 / GAIA-2 (Wayve) — driving world models
  • NVIDIA Cosmos — a world-foundation-model platform aimed at training and evaluating embodied/robot policies; the bridge to Robotics
  • OASIS / Decart — open neural Minecraft
  • Latent action models — inferring actions from unlabeled video (so you can train world models without paired actions)
  • Real-time constraints — < 50 ms/frame for interactivity. Forces distillation, caching, or smaller models — the same autoregressive-distillation toolkit as Phase 8
  • Connection to physical RL and robotics — world models are policy-rollouts-as-video; the same model can serve as a simulator for an RL agent (RL Phase 6) or as a learned simulator for an embodied policy (Robotics Phase 9)
  • Connection to multimodal — a fully general world model is multimodal: text in, video out, with audio, actions, and physics. Joint cross-modal understanding is Multimodal Learning's territory

The World Model Loop

┌─────────────────────────┐
│ │
(state s_t) │ World model │
(action a_t)──────────►│ p(s_{t+1} | s_t, a_t)│──────► (frame s_{t+1})
│ │
└─────────────────────────┘


(s_{t+1} fed back as next s_t)

Run this in a loop, with actions from a human (interactive game),
an RL policy (sim-for-RL), or a planner (model-based control).

A world model is a video generator that also takes actions —
or equivalently, a video generator IS a world model when "action"
is the empty string.

Projects

ProjectDescriptionDifficulty
Action-conditioned videoTake a small video diffusion model; add a discrete-action input (e.g., 4 game actions); train on a simple game's recorded play⭐⭐⭐⭐⭐
GameNGen reproduction (mini)Train an action-conditioned model on a simpler game (Atari, GridWorld) and play it interactively⭐⭐⭐⭐⭐
Latent action inferenceTrain a model to infer the latent action between two adjacent frames in unlabeled video (Genie-style)⭐⭐⭐⭐⭐
World model for RLUse a learned world model to roll out trajectories; train a policy in the dream (DreamerV3-light)⭐⭐⭐⭐⭐
Real-time latency huntDistill a 30-fps diffusion video model into a 4-step (or 1-step) consistency model; measure ms/frame⭐⭐⭐⭐

Key Insight

World models are the convergence point of three lines of research that are usually taught separately: video generation, model-based RL, and simulation. Each of those communities approaches the same object from a different angle — generation people care about visual fidelity, RL people care about action conditioning and rollouts, simulation people care about physical realism. The 2025–2026 frontier is increasingly the same model used in all three roles. This guide owns the visual-fidelity, action-conditioning, and rollout-generation side; the control side lives in RL Phase 6 and Robotics Phase 9. If you've completed those guides and this one, you're well-positioned to work at the intersection.

Resources


Phase 10: Training at Scale, Evaluation, and Frontier Topics

This last phase is the operational reality of video generation: data, compute, eval, and what's still open.

Training at Scale

  • Data sources:
    • Public: HD-VILA, WebVid (deprecated), Panda-70M, OpenVid-1M, Koala-36M
    • Proprietary: most strong models train on private licensed video libraries
  • Caption generation — public video has terrible captions; recaption with a strong VLM (Qwen2-VL, LLaVA, GPT-4o) before training. This is the single highest-leverage data trick
  • Aspect-ratio bucketing — train on multiple aspect ratios together for variable-resolution inference
  • Clip extraction — scene detection + filtering (motion score, aesthetic score, OCR-text score)
  • Curriculum — start at low resolution and short duration, scale up gradually
  • Compute: a frontier text-to-video model is on the order of 10²⁵–10²⁶ FLOPs of training; an open replication is 10²³–10²⁴

Evaluation

The evaluation problem in video generation is worse than in image generation, which is already bad.

  • Automatic metrics:
    • FVD (Fréchet Video Distance) — the standard, but criticized for poor correlation with human judgment
    • CLIPScore-Video, VideoCLIP — text-video alignment
    • VBench — comprehensive benchmark suite; the closest thing to a standard
    • EvalCrafter — open evaluation harness
  • Human evaluation — still the gold standard; pairwise comparisons, win rates
  • Physical correctness — does water behave like water? Do objects persist when occluded? Largely unmeasured
  • Sora's own evaluation criteria mention things like "object permanence" and "world consistency" — these still don't have clean benchmarks

Frontier Topics

  • Real-time and streaming video generation — distillation to 1–4 steps, consistency models, autoregressive caching (CausVid, Self-Forcing); LTX-Video-style architectures built for latency. Increasingly the difference between a demo and a product
  • Native audio-video joint generation — as of 2025 this has gone from research to product: Veo 3 generates synchronized dialogue and SFX, Sora 2 adds audio, Movie Gen describes a joint recipe. Native AV models are replacing post-hoc dubbing
  • Multi-character, multi-scene narratives — see Phase 8
  • Physical realism — making fluid behave like fluid, deformable objects deform correctly
  • 3D-consistent video — output that's consistent under camera change (videos that can be re-rendered from a new viewpoint); bridges to NeRF / 3D Gaussian Splatting
  • Editable / re-renderable output — output something more structured than pixels (e.g., a 3D scene + camera path)
  • Safety: deepfake detection, watermarking (e.g., SynthID), content moderation, consent
  • Interactive / playable — see Phase 9
  • Long-context multimodal video — feeding hours of video to a VLM for understanding; the inverse direction, but adjacent

Projects

ProjectDescriptionDifficulty
Run VBench end to endEvaluate an open T2V model on the full VBench suite; reproduce a leaderboard number⭐⭐⭐
Recaption a datasetTake 100k clips with bad captions, recaption with a strong VLM, train a small model on each — compare quality⭐⭐⭐⭐
Aspect-ratio bucketingImplement bucketed batching for variable aspect ratios; observe quality improvement on portrait/wide test sets⭐⭐⭐
Consistency-model distillationDistill a 50-step video diffusion model into a 4-step student; measure speed and quality loss⭐⭐⭐⭐⭐
WatermarkingAdd invisible watermarking to your model's outputs; verify with a detector⭐⭐⭐⭐
Physical-plausibility probeBuild 50 trick prompts (water flowing uphill, dropped objects floating); evaluate open models⭐⭐⭐

Key Insight

Video generation in 2026 is where text generation was around 2021 — extraordinary demos, frustrating gap to product, two or three competing architectural bets, and absolutely no consensus on evaluation. The compute frontier is rapidly closing in on the data frontier: training a Sora-class model is no longer compute-impossible for many organizations, but obtaining the licensed long-form video to train it on is now the harder problem. If you're entering the field, the highest-leverage skills are not architecture (it's converging on DiT) — they're data engineering, evaluation, and control surfaces.

Resources


Suggested Timeline

PhaseDurationOutcome
0. Prerequisites0–2 weeksImage diffusion + multimodal foundations solid
1. Foundations1 weekComfortable loading and decoding video data
2. Classical1 weekFamiliar with the pre-diffusion approaches; skim only
3. I2V1–2 weeksBuilt or fine-tuned an image-to-video model
4. Video diffusion3 weeksInflated a 2D U-Net to 3D and trained on small video
5. Latent + VAE2–3 weeksTrained a 3D VAE; diffusion runs in its latent space
6. DiT3–4 weeksImplemented or ran a DiT-based video model; understand flow matching
7. Conditioning2 weeksAdded at least two control signals (camera, depth, character)
8. Long-form2–3 weeksSliding window or hierarchical pipeline working end to end
9. World models2–3 weeksTrained an action-conditioned model; can roll out interactively
10. Scale + evalOngoingReal benchmark evaluation; data pipeline understood

Total to "comfortable practitioner": ~4–5 months of focused study. Frontier-research-comfortable: closer to a year.


Key Advice

  1. Don't try pixel-space. Past 64×64 it's wasted compute. Latent space is non-negotiable.
  2. Inflate first, train from scratch later. Your first video model should reuse pretrained image weights. Going scratch is a frontier-lab activity.
  3. Joint image-video training. Co-training preserves still-image quality and dramatically helps data efficiency.
  4. Recaption your data. Web alt-text and YouTube descriptions are terrible. A strong VLM recaptioning your training video is the highest-leverage single change you can make.
  5. The VAE matters as much as the diffusion model. Bad reconstructions cap your output quality. Spend serious effort here.
  6. Profile decoding. Most video-training pipelines are bottlenecked on video decoding, not on the GPU. Use decord, prefer keyframe-aligned sampling, cache when possible.
  7. bf16 everywhere on Ampere+. Same as elsewhere; float16 GradScalers are unnecessary friction.
  8. Aspect ratios matter. Train on multiple bucket ratios; resist the urge to center-crop everything to square.
  9. Evaluate with a suite. Don't trust a single FVD number. Use VBench plus human eval, and report failures honestly.
  10. Watch the open-source frontier. OpenSora, CogVideoX, HunyuanVideo, Mochi, Wan — these move every few months. The state of "what an individual researcher can run" changes faster here than anywhere else in ML.

Common Pitfalls to Avoid

  • ❌ Trying to train pixel-space diffusion at meaningful resolution
  • ❌ Using a 2D VAE per frame and being surprised by temporal flicker
  • ❌ Ignoring the VAE and treating it as a fixed black box
  • ❌ Training only on video and watching still-image quality collapse
  • ❌ Loading video with PIL frame by frame instead of decord
  • ❌ Storing decoded frames as fp32 on disk
  • ❌ Using CLIP-L for text conditioning when prompts are >77 tokens (use T5)
  • ❌ Reporting only FVD with no human eval
  • ❌ Trying to generate >10 seconds without a longform strategy
  • ❌ Forgetting to validate frame-by-frame consistency, not just per-frame quality

Additional Resources

Books and Long-Form Reading

Key Papers, Chronologically

YearPaperContribution
2022Video Diffusion ModelsFirst principled paper
2022Make-A-VideoText-conditioned video, image-pretrain trick
2022Imagen VideoCascaded high-res video
2023Align Your LatentsLatent video diffusion
2023Stable Video DiffusionOpen I2V baseline
2023AnimateDiffMotion module, community SD
2023MagViT-v2Best discrete video tokenizer
2024Sora technical reportDiT + variable patches
2024GameNGenReal-time neural Doom
2024CogVideoXStrong open DiT + VAE
2024Movie GenFrontier-scale recipe, open description, joint audio
2024HunyuanVideoLarge open release
2024Genie 2Foundation world model
2024Diffusion ForcingPer-token noise levels; AR rollout meets diffusion
2024CausVidCausal distillation for streaming generation
2025NVIDIA CosmosWorld-foundation-model platform for embodied AI
2025Self-ForcingCloses the AR train/inference gap; real-time long video

Tools You Should Know

  • decord — fast video loading
  • diffusers (Hugging Face) — for inference and quick prototyping
  • OpenSora / CogVideoX / HunyuanVideo — open training stacks
  • VBench — evaluation harness
  • comfyui — for rapid pipeline prototyping with open models
  • ffmpeg — you will need it

Communities


Quick Start Checklist

  • Can load a video clip with decord and explain frame sampling vs uniform-in-time
  • Can explain why latent space is mandatory for video generation
  • Have run inference on Stable Video Diffusion and AnimateDiff
  • Have inflated a 2D U-Net to a 3D (or (2+1)D) model and trained it on small video
  • Have trained or used a 3D VAE; understand causal video VAEs
  • Have read the Sora technical report end to end
  • Have implemented or run a DiT-based video model
  • Understand flow matching as well as DDPM
  • Can add a control signal (camera, depth, pose) to a video model
  • Have generated >10 sec of video with a longform strategy
  • Have evaluated a model on VBench (or a substantial subset)
  • Have at least skimmed an action-conditioned world model paper (Genie 2 or GameNGen)

License

MIT License. See the LICENSE file for details.