GQA Ablation

Fewer key/value heads means a smaller cache — the cheapest way to make a model serve faster.

Key Insight

Multi-head attention gives every query head its own keys and values; Multi-Query Attention (MQA) shares a single key/value head across all of them; Grouped-Query Attention (GQA) sits in between. Fewer key/value heads means a smaller KV cache, at some cost to quality.

Why This Matters

The KV cache dominates memory at serving time, so this trade-off decides how many users a model can handle at once. Training matched 100M models with MHA, GQA, and MQA lets you measure the cache savings against the validation-loss cost directly.

Key Insight​

Why This Matters​

Key Insight

Why This Matters