Reward-Hacking Forensics
When the score goes up but the answers get worse, find out which part broke.
Key Insight
This project deliberately trains a reward-hacked model, then traces the failure back to its real source — the reward model, the KL penalty (β), or the rollout distribution. Forensics here means working backward from the broken behavior to the cause instead of guessing.
Why This Matters
Reward hacking is the most common way RLHF goes wrong, and the symptom rarely points straight at the cause. Learning to diagnose it systematically is what separates a frustrating week from a quick fix.