Skip to main content

Train a 10M-Parameter LM


The whole training loop fits on one screen — train it until none of it feels like magic.


Key Insight

A 10-million-parameter language model trained on a tiny Shakespeare file is the smallest honest pretraining run. The model is too small to be useful, which is the point: it is small enough to read every line of the loop and watch the next-token-prediction objective at work.

Why This Matters

Watching your own loss curve fall, on your own machine, removes the mystery from the whole field. Every billion-dollar training run is this same loop — forward pass, loss, backward pass, optimizer step — scaled up.