CPU/NVMe Offload
When the GPU runs out of room, push the rarely-touched cache down to cheaper memory instead of dropping the request.
Key Insight
This project adds a second tier to the KV cache: when GPU HBM fills up, cold (rarely-used) cache blocks are moved out to CPU RAM and loaded back when a request needs them again. It measures the reload cost against the throughput gained on long-running sessions.
Why This Matters
Very long chats and agent sessions can hold more cache than fits on the GPU. Tiering to cheaper, slower memory lets those sessions survive instead of being dropped — but every reload adds latency, so measuring the trade-off tells you when offload helps and when it hurts.