Workload Sensitivity

Speculation flies on predictable text and stalls on surprising text.

Key Insight

This project measures speculative-decoding speedup across very different workloads — chat, code completion, summarization, and constrained JSON output — and explains why the acceptance rate, and therefore the speedup, varies so much between them.

Why This Matters

How predictable the next token is decides how often the draft guesses right, so the same system can see 3× on copy-heavy code yet barely 1.3× on open-ended chat. Knowing your workload's acceptance rate before you promise a latency number keeps you from over-claiming.

Key Insight​

Why This Matters​

Key Insight

Why This Matters