Implement ViT From Scratch
Key Insight
A ViT treats an image as a short sentence whose "words" are square patches: you cut the picture into a grid, project each block into a token (patchification), prepend a learnable CLS token, and run the sequence through ordinary transformer blocks, then read the CLS token's final vector as the whole-image summary. Coding every piece yourself on CIFAR-10 — small enough to train on a laptop — makes the key abstraction concrete: the transformer never "sees" a 2D picture, only a list of vectors, so the same block that reads words reads pixels once you hand it patches.