FP8 Serving
Half the bits of bfloat16, much more throughput on modern GPUs.
Key Insight
This project converts a model's weights, activations, and KV cache to FP8 using NVIDIA's TransformerEngine, verifies that quality on a small benchmark suite holds up, and measures the latency improvement on a Hopper-class GPU.
Why This Matters
FP8 halves the memory and the bandwidth read for every parameter compared with bfloat16, and Hopper- and Blackwell-class GPUs have dedicated FP8 Tensor Cores, so the format is rapidly becoming the production default for new serving stacks — a near-free speedup when the hardware supports it.