Activation Checkpointing Study

Throw activations away on the way forward, recompute them on the way back — trade compute for memory.

Key Insight

Activation checkpointing discards the intermediate activations saved during the forward pass and recomputes them during the backward pass. This study runs the same model with and without it, measuring the memory saved against the extra step time it costs.

Why This Matters

For long sequences, activations — not weights — often dominate training memory. Spending a little recompute to reclaim a lot of memory is what lets you fit bigger batches or longer contexts onto the same GPU.

Key Insight​

Why This Matters​

Key Insight

Why This Matters