Best-of-N with a Reward Model

Generate many answers, then keep the one a judge likes best.

Key Insight

Best-of-N sampling draws N candidate answers and uses a reward model to score them, keeping the highest-scoring one. This project compares that approach against self-consistency majority voting on a math benchmark.

Why This Matters

When a learned scorer recognizes a good answer better than a plain vote does, Best-of-N picks winners that majority voting would miss. It is a simple, effective way to spend extra inference compute on quality.

Key Insight​

Why This Matters​

Key Insight

Why This Matters