Multi-Head Attention
Run attention many times in parallel, each head free to focus on a different kind of relationship.
Key Insight
Multi-head attention splits the model dimension into several heads, runs attention in each one independently, then concatenates the results and projects them back. Grouped-Query Attention (GQA) is a small twist: several query heads share one set of keys and values to shrink the KV cache.
Why This Matters
Every production transformer uses multi-head attention, and nearly every model since 2024 uses GQA to serve faster. Verifying your version against nn.MultiheadAttention confirms you have the tensor reshapes exactly right — the part that is easy to get subtly wrong.