DPO from Scratch
Get the result of RLHF without running any of the reinforcement learning.
Key Insight
This project implements DPO by hand and checks it against the reference loss in a library like TRL. DPO skips the reward model and the PPO loop entirely, collapsing preference learning into a single supervised loss on (chosen, rejected) answer pairs.
Why This Matters
DPO made alignment dramatically simpler — no reward model, no rollouts, just two models and a loss. It became a default open-source recipe because it captures much of RLHF's benefit with a fraction of the moving parts.