PPO RLHF Loop
Chase the reward, but stay tied to the model you started from.
Key Insight
This project wires together SFT, a reward model, and PPO into a full RLHF loop, watching the reward climb and the KL divergence from the reference model. Lowering the KL penalty (β) on purpose makes the policy start reward hacking.
Why This Matters
PPO-based RLHF is the classic recipe that first made chatbots both helpful and safe. Seeing the KL term hold the policy in check teaches the single most important knob in the entire alignment stack.