MMLU Re-run

Before you trust a leaderboard number, reproduce it yourself.

Key Insight

This project runs an open model through MMLU — a 57-subject multiple-choice benchmark — scores it per category, and checks the total against the number the model's authors published.

Why This Matters

Reproducing a known score teaches you that small choices — the prompt format, the sampling settings, how you parse the model's letter answer — can move a benchmark result by several points, so a single number means little without the setup that produced it.

Key Insight​

Why This Matters​

Key Insight

Why This Matters