Prefix-Cache Study

Stop redoing the same long system prompt for every request.

Key Insight

This project runs the same workload twice on a vLLM server — once with the prefix cache turned on and once with it turned off — then compares the tail latency and the TTFT for requests that share a long system message.

Why This Matters

Real traffic is mostly a long shared system prompt followed by a short user turn, so caching the keys and values for that prefix means the boilerplate is processed only once across many users — a quiet but huge throughput win whenever the prompts your users send have a fixed beginning.

Key Insight​

Why This Matters​

Key Insight

Why This Matters