Arena Reproduction

Rank models the way chess ranks players: by who beats whom.

Key Insight

This project runs a small tournament where five open models answer the same prompts, an LLM-as-judge picks each winner, and the wins and losses become Elo ratings — the arena style of evaluation.

Why This Matters

Pairwise "which is better?" comparisons are often more reliable than fixed-answer benchmarks for open-ended quality, but the resulting ranking can wobble with the random seed and the choice of judge — something you only appreciate by reproducing it.

Key Insight​

Why This Matters​

Key Insight

Why This Matters