Memory Breakdown

Training memory is four buckets — know which one overflows.

Key Insight

Training memory has four parts: the model parameters, their gradients, the optimizer state (Adam keeps two extra values per parameter), and the activations saved for the backward pass. Summing these predicts usage, which torch.cuda.memory_summary() then confirms.

Why This Matters

When you hit out-of-memory, knowing which bucket is largest tells you which lever to pull — a smaller batch, gradient checkpointing, or a lighter optimizer — instead of guessing.

Key Insight​

Why This Matters​

Key Insight

Why This Matters