Skip to main content

KV Cache From Scratch


The fastest way to understand the KV cache is to delete it, watch decode crawl, then add it back.


Key Insight

This project bolts a simple, contiguous KV cache onto a small transformer and checks that generating with the cache produces exactly the same tokens as generating without it — bit-for-bit. The cache stores the attention keys and values from earlier tokens, so each decode step only has to compute them for the single new token.

Why This Matters

Without a cache, every new token re-reads and re-computes the entire prompt, so generation gets slower the longer it runs. Building the cache yourself — and proving it changes speed but not output — is the cleanest way to trust that this optimization is safe before you rely on it in a real serving engine.