Caption Ablation

Key Insight

This is a controlled ablation: train two otherwise-identical small text-to-image models that differ in exactly one thing — one sees the original web alt-text, the other sees synthetic captions rewritten by a VLM — so any quality gap is caused by the captions alone. The recaptioned model will follow prompts noticeably better, which is the open-source confirmation of the trick behind DALL·E 3's compositional skill. It teaches the most counter-intuitive lesson in the field: improving the captions often beats improving the model.

Key Insight​

Key Insight