Train a Reward Model

Turn "humans liked this answer better" into a number a machine can chase.

Key Insight

This project trains a reward model on human preference pairs — for a given prompt, which of two answers a person preferred — and reports how often it agrees with people. The model learns to give a higher score to the response a human would have picked.

Why This Matters

A reward model converts messy, one-off human judgment into a reusable score that RLHF can optimize against millions of times. Its accuracy sets a ceiling on how good the aligned model can ever become.

Key Insight​

Why This Matters​

Key Insight

Why This Matters