Cache-Aware Admission
Don't let a request in the door if there is no room left for its cache.
Key Insight
This project implements admission control that estimates how much KV cache a new request will need and refuses it when the GPU cannot fit it — then verifies the server never runs out of memory.
Why This Matters
If a scheduler admits more requests than the cache can hold, the whole server can crash with an out-of-memory error and drop everyone's work at once. Checking the projected cache size before admitting keeps the system stable even when traffic spikes past what it can handle.