Skip to main content

Run a VLM Evaluation Harness

Key Insight

Scoring a VLM by hand is hopeless: there are dozens of benchmarks, each with its own answer format, prompt wording, and scoring script. An evaluation harness like lmms-eval or VLMEvalKit packages all of that so a single command runs your model across many benchmarks (MMBench, MMMU, DocVQA, and more) under identical, version-pinned conditions. This project's real lesson is that comparisons are only fair when every model sees the same prompt and is graded by the same parser — a small change in how a multiple-choice letter is extracted can swing a score by several points, so a shared harness is what makes one paper's numbers actually comparable to another's. Running it on an open VLM across 6+ benchmarks turns "is this model good?" into a concrete, reproducible score table you can defend.