Gradient Accumulation
Add up the gradients of several small batches, then step as if the batch were huge.
Key Insight
Gradient accumulation runs several small batches, adds up their gradients, and only calls the optimizer's step after a set number of them. Because gradients add, the result matches one large batch — while only one small batch's activations ever sit in memory at once.
Why This Matters
It lets a small GPU train at a large effective batch size, reproducing results that would otherwise need bigger or more numerous GPUs.