FP4 (Blackwell) Deployment
Four bits per weight, accelerated in hardware — if you can keep the quality.
Key Insight
On Blackwell GPUs that support it in hardware, this project benchmarks FP4 weights against FP8, reporting the throughput gain, any quality loss, and the operational gotchas of running such a new format.
Why This Matters
FP4 halves weight size again versus FP8, promising more speed and more concurrency — but 4-bit floating point sits close to the edge of usable precision, so you must measure carefully before trusting it in production.