Skip to main content

Multimodal DPO

Key Insight

DPO (Direct Preference Optimization) teaches a model to prefer better answers by training directly on pairs of (chosen, rejected) responses, with no separate reward model and no reinforcement-learning loop to babysit — which is exactly what makes it cheap enough to run on a small project. The multimodal twist is where the preference pairs come from: each pair is two VLM answers to the same image-and-question, and a common reason one answer is "rejected" is hallucination — confidently describing an object that isn't actually in the picture. Collecting even a few hundred such image-grounded preference pairs and fine-tuning with DPO measurably cuts that hallucination, showing that alignment for multimodal models is less about new algorithms than about preference data anchored to what the image really contains.