Contamination Probe

A perfect score means nothing if the model already saw the answers.

Key Insight

This project searches a model's pretraining data for exact and near-duplicate (via MinHash) copies of an evaluation set's questions, measuring how much contamination has leaked in.

Why This Matters

A model that memorized the test answers looks brilliant on the benchmark and fails in the real world, so checking for contamination is the only way to tell whether a high score reflects skill or leakage.

Key Insight​

Why This Matters​

Key Insight

Why This Matters