Skip to main content

Prefix-Cache Study


Stop redoing the same long system prompt for every request.


Key Insight

This project runs the same workload twice on a vLLM server — once with the prefix cache turned on and once with it turned off — then compares the tail latency and the TTFT for requests that share a long system message.

Why This Matters

Real traffic is mostly a long shared system prompt followed by a short user turn, so caching the keys and values for that prefix means the boilerplate is processed only once across many users — a quiet but huge throughput win whenever the prompts your users send have a fixed beginning.