AMP Speedup Study
Half the bits, twice the speed — and usually the same answer.
Key Insight
Automatic mixed precision runs most operations in 16-bit floats — float16 or bfloat16 — instead of float32, which halves memory traffic and unlocks fast Tensor Cores. float16 needs a GradScaler to avoid underflow; bfloat16 shares float32's range and does not.
Why This Matters
Mixed precision often gives a 2–3× speedup for one or two lines of code, with little or no loss in final accuracy — one of the highest-return changes you can make to a training script.