Gradient Checkpointing

Trade compute for memory to train bigger models.

Key Insight

Normally, PyTorch saves all intermediate activations during the forward pass to use them in the backward pass. Gradient checkpointing discards most of these activations to save memory, and simply recomputes them on-the-fly during the backward pass when they are needed.

Why This Matters

Memory is often the biggest bottleneck in deep learning. Checkpointing allows you to train significantly larger models or use larger batch sizes on a single GPU, making it a critical technique for scaling up.

Key Insight​

Why This Matters​

Key Insight

Why This Matters