Skip to main content

DPO from Scratch


Get the result of RLHF without running any of the reinforcement learning.


Key Insight

This project implements DPO by hand and checks it against the reference loss in a library like TRL. DPO skips the reward model and the PPO loop entirely, collapsing preference learning into a single supervised loss on (chosen, rejected) answer pairs.

Why This Matters

DPO made alignment dramatically simpler — no reward model, no rollouts, just two models and a loss. It became a default open-source recipe because it captures much of RLHF's benefit with a fraction of the moving parts.