DPO from Scratch

Get the result of RLHF without running any of the reinforcement learning.

Key Insight

This project implements DPO by hand and checks it against the reference loss in a library like TRL. DPO skips the reward model and the PPO loop entirely, collapsing preference learning into a single supervised loss on (chosen, rejected) answer pairs.

Why This Matters

DPO made alignment dramatically simpler — no reward model, no rollouts, just two models and a loss. It became a default open-source recipe because it captures much of RLHF's benefit with a fraction of the moving parts.

Key Insight​

Why This Matters​

Key Insight

Why This Matters