Skip to main content

Sampling-Mode Rejection


Going faster must not make the model's answers any more or less random.


Key Insight

This project extends speculative decoding to random sampling, not just greedy decoding: it adds a rejection sampling step that accepts or rejects each draft token with exactly the right probability, so the final output is statistically identical to sampling from the target model alone.

Why This Matters

Without this careful accept/reject rule, speeding up generation would subtly distort the model's output distribution — making it more or less random than intended. The rejection step is what lets speculative decoding stay provably lossless even with temperature and top-p sampling turned on.