Pre-Norm vs Post-Norm

Where you place the normalization decides whether training is smooth or blows up.

Key Insight

Pre-norm puts the normalization step inside each residual branch (x = x + Attn(Norm(x))), while post-norm normalizes after the residual is added. Pre-norm trains stably even without learning-rate warmup; post-norm often needs warmup and can diverge without it.

Why This Matters

Every modern transformer is pre-norm for exactly this reason. Training two otherwise-identical models, with and without warmup, turns an abstract design rule into something you have watched succeed and fail with your own eyes.

Key Insight​

Why This Matters​

Key Insight

Why This Matters