W4A8 Ablation
4-bit weights save memory; adding 8-bit activations saves compute — measure what each one really buys.
Key Insight
This project runs an ablation comparing weight-only 4-bit quantization against W4A8 — 4-bit weights plus 8-bit activations — measuring both answer quality and throughput (tokens per second).
Why This Matters
Quantizing weights mainly saves memory and bandwidth, while quantizing the activations too can speed up the actual math — but activations are harder to compress safely. Measuring the real quality-versus-speed trade-off tells you whether the extra step is worth it for your workload.