Mini R1 Recipe
Reward only correct answers, and watch the model teach itself to reason.
Key Insight
This project reproduces a small version of the DeepSeek-R1 recipe: lightly fine-tune (SFT) a base model on a few reasoning traces, then run GRPO with a simple "is the answer correct?" verifier — a form of RLVR — on math problems.
Why This Matters
With nothing but a correctness signal, the model spontaneously grows long chain-of-thought habits like backtracking and self-checking. This emergence is the core discovery behind modern reasoning models.