Mini R1 Recipe

Reward only correct answers, and watch the model teach itself to reason.

Key Insight

This project reproduces a small version of the DeepSeek-R1 recipe: lightly fine-tune (SFT) a base model on a few reasoning traces, then run GRPO with a simple "is the answer correct?" verifier — a form of RLVR — on math problems.

Why This Matters

With nothing but a correctness signal, the model spontaneously grows long chain-of-thought habits like backtracking and self-checking. This emergence is the core discovery behind modern reasoning models.

Key Insight​

Why This Matters​

Key Insight

Why This Matters