Implement ViT From Scratch

Key Insight

A ViT treats an image as a short sentence whose "words" are square patches: you cut the picture into a grid, project each block into a token (patchification), prepend a learnable CLS token, and run the sequence through ordinary transformer blocks, then read the CLS token's final vector as the whole-image summary. Coding every piece yourself on CIFAR-10 — small enough to train on a laptop — makes the key abstraction concrete: the transformer never "sees" a 2D picture, only a list of vectors, so the same block that reads words reads pixels once you hand it patches.

Key Insight​

Key Insight