Skip to main content

Train a 100M-Parameter LM


Ten times bigger, real web text, and a number to beat — the first run that feels like the real thing.


Key Insight

Scaling to 100 million parameters and training on a slice of real web text (OpenWebText) is the first pretraining run that behaves like a production one. The concrete goal — push validation loss below 3.5 — turns "is it working?" into a measurable target.

Why This Matters

Real data, a real GPU, and a fixed time budget force the skill every practitioner needs: reading a loss curve to judge whether a run is healthy, stalled, or diverging — long before it finishes.