Skip to main content

Stateful Session API


Think of a multi-turn conversation like keeping a notebook open instead of erasing the whiteboard after every question. In AI terms, it's just a KV cache you refuse to throw away.


Key Insight

This project builds a session API that keeps a conversation's KV cache alive across separate calls, so each new turn reuses the work of every previous turn instead of re-reading the whole history. Because GPU memory is finite, it must also evict idle sessions under pressure — and the project verifies the cache-hit rate to confirm warm sessions are actually being reused.

Why This Matters

In long agent and chat sessions the history grows every turn, and re-running prefill over the full transcript each time is both slow and expensive. Treating the cache as a session-lifetime resource — created, reused, and evicted on demand — is what makes multi-turn, multi-tenant systems both fast and stable under load.