PPO RLHF Loop

Chase the reward, but stay tied to the model you started from.

Key Insight

This project wires together SFT, a reward model, and PPO into a full RLHF loop, watching the reward climb and the KL divergence from the reference model. Lowering the KL penalty (β) on purpose makes the policy start reward hacking.

Why This Matters

PPO-based RLHF is the classic recipe that first made chatbots both helpful and safe. Seeing the KL term hold the policy in check teaches the single most important knob in the entire alignment stack.

Key Insight​

Why This Matters​

Key Insight

Why This Matters