Skip to main content

Prompt Sensitivity Sweep


Same model, same questions, five wordings — five different scores.


Key Insight

This project holds the model and the benchmark fixed and changes only how the question is phrased, running a sweep over five prompt formats and plotting how far the accuracy moves.

Why This Matters

If rewording a prompt swings the score by ten points, then comparing two models is meaningless unless both are tested with the same template — a trap that has produced many bogus "model A beats model B" claims.