KV Size Calculator

On a busy server it is the KV cache, not the model weights, that usually runs out of memory first.

Key Insight

This project turns the KV-cache size formula into code and sweeps its inputs — number of key/value heads, sequence length, and dtype — to plot how much GPU memory the KV cache eats as you serve more users at once. It makes clear why tricks like GQA (sharing key/value heads) and lower-precision storage matter so much.

Why This Matters

How many users a single GPU can serve at once is set mostly by KV-cache memory, not by the size of the weights. Being able to estimate that number on a napkin tells you whether a deployment will fit in HBM — before you commit money to hardware.

Key Insight​

Why This Matters​

Key Insight

Why This Matters