Process Reward Model

Grade every step of the reasoning, not just the final answer.

Key Insight

A process reward model (PRM) scores each individual step in a model's reasoning, unlike an outcome reward model that judges only the final answer. This project trains a small PRM on the PRM800K dataset and uses it to re-rank generated solutions.

Why This Matters

Pinpointing the exact step where reasoning goes wrong gives a far sharper signal than a single right-or-wrong label on the final answer — which is why PRMs are strong scorers for Best-of-N and tree search.

Key Insight​

Why This Matters​

Key Insight

Why This Matters