Skip to main content

Implement AdamW from Scratch


Adam remembers the past. AdamW forgets your weight.


Key Insight

AdamW is Adam with weight decay applied directly to the parameters rather than to the gradients. It also uses bias correction to counteract the zero-initialization of the first and second moment estimates, which would otherwise cause very small steps at the start of training.

Why This Matters

AdamW is the default optimizer for most modern language and vision models. Implementing it by hand makes the update rule concrete and clarifies why decoupled weight decay outperforms L2 regularization — a distinction that matters especially when training transformers.