GenEval Run

Key Insight

Pretty pictures aren't the same as correct pictures, and GenEval measures the difference by checking, with an object detector, whether a generated image actually contains the right number, color, and arrangement of objects the prompt asked for. Running an open text-to-image model through this benchmark surfaces its real weaknesses — miscounting, swapped attributes, ignored spatial relations — that beauty metrics like FID never reveal. The takeaway is that compositional adherence is a separate axis from raw image quality, and you must measure it on purpose.

Key Insight​

Key Insight