Disaggregated PoC

Let one machine read the prompt and another write the answer, and ship the cache between them.

Key Insight

This project builds a small proof-of-concept of disaggregated serving: one process runs prefill, a second runs decode, and they hand off the KV cache between them. It then measures the transfer overhead against doing both phases in one process.

Why This Matters

Prefill is compute-heavy while decode is memory-bandwidth-heavy, so giving each phase its own pool of GPUs lets you size hardware for each job independently. The proof-of-concept shows whether the cost of moving the cache between them is small enough to make that split worthwhile.

Key Insight​

Why This Matters​

Key Insight

Why This Matters