KV Cache From Scratch

The fastest way to understand the KV cache is to delete it, watch decode crawl, then add it back.

Key Insight

This project bolts a simple, contiguous KV cache onto a small transformer and checks that generating with the cache produces exactly the same tokens as generating without it — bit-for-bit. The cache stores the attention keys and values from earlier tokens, so each decode step only has to compute them for the single new token.

Why This Matters

Without a cache, every new token re-reads and re-computes the entire prompt, so generation gets slower the longer it runs. Building the cache yourself — and proving it changes speed but not output — is the cleanest way to trust that this optimization is safe before you rely on it in a real serving engine.

Key Insight​

Why This Matters​

Key Insight

Why This Matters