Speculative Decoding

Let a small model guess and the big one only check.

Key Insight

This project pairs a 1B draft model with a 7B target model for speculative decoding: the draft cheaply proposes a few tokens ahead at each step, the target verifies them all in a single parallel pass, and the longest matching prefix is kept. The project then measures the acceptance rate and the wall-clock speedup.

Why This Matters

Single-token decoding is the slowest part of serving an LLM because it cannot be batched along the time axis — each new token has to wait for the one before it to be sampled before it can even start, like cars in a single-lane tunnel where the next car cannot enter until the one ahead clears the gate. (Batching across requests adds more cars in parallel lanes; batching along the time axis would mean letting cars in the same lane move at once, which the chain of dependencies forbids.) Letting a cheap helper propose several tokens that the big model verifies in parallel sidesteps this bottleneck and routinely gives a 2–4× speedup with no quality loss, which is why speculative decoding is now standard in production stacks.

Key Insight​

Why This Matters​

Key Insight

Why This Matters