MagViT-v2-Style Tokenizer
Key Insight
Diffusion models want continuous latents, but transformer and autoregressive models want discrete tokens — and MagViT-v2 is the strongest open recipe for turning video into a grid of discrete codes. This project rebuilds its core idea: instead of a learned codebook (which can suffer codebook collapse, where most entries go unused), it discretizes each latent with FSQ or LFQ — two codebook-free schemes that simply snap each coordinate onto a fixed grid, sidestepping collapse entirely. You measure quality with reconstruction FID: encode real clips to tokens, decode them back, and score how close the rebuilt frames look to the originals. The payoff is that good discrete video tokens let you generate video with the very same next-token machinery used for language.