VQ-VAE on CIFAR-10

Key Insight

A VQ-VAE is an autoencoder whose hidden code is forced to be discrete: instead of any continuous numbers, the encoder must describe each patch of the image using entries chosen from a small fixed list called a codebook — like painting only with the colors in a numbered paint set. This project builds one on CIFAR-10 with a 256-entry codebook that turns each 32×32 image into an 8×8 grid of code indices, then decodes that grid back into pixels. The 8×8 grid is a compression choice: the encoder shrinks the image 4× on each side (32 ÷ 4 = 8), so each code summarizes a 4×4 block of pixels and the whole picture becomes just 64 codes — a 16×16 grid would shrink only 2×, keeping 256 codes that preserve more detail but cost four times as many positions to store and later generate. (This grid size — the number of code positions — is a separate knob from the 256-entry codebook, which sets how many distinct values each position may take.) By plotting how often each codebook entry is used and comparing the rebuilt images to the originals, you can see how a tiny vocabulary of learned patterns is enough to reconstruct whole pictures. The trick that makes training possible — passing gradients straight through the non-differentiable lookup — is the straight-through estimator.

Key Insight​

Key Insight