Discrete Image Tokens

Key Insight

A VQ-VAE turns a picture into a small grid of whole-number codes — discrete image tokens — by forcing every patch to pick the nearest entry from a fixed codebook, the way a paint-by-numbers kit makes you choose from a numbered palette instead of mixing any color you like. That single move is what lets a transformer model images with the very same next-token-prediction machinery it uses for text, which is the whole foundation of the any-to-any models in this phase. The "1024 tokens/image" target comes from laying the codes out as a 32×32 grid (32 × 32 = 1024): more tokens buy a sharper reconstruction but a longer sequence for any downstream model to read, and along the way you must watch for codebook collapse, where the model leans on only a handful of codes and leaves most of the palette unused.

Key Insight​

Key Insight