Multimodal DPO

Key Insight

DPO (Direct Preference Optimization) teaches a model to prefer better answers by training directly on pairs of (chosen, rejected) responses, with no separate reward model and no reinforcement-learning loop to babysit — which is exactly what makes it cheap enough to run on a small project. The multimodal twist is where the preference pairs come from: each pair is two VLM answers to the same image-and-question, and a common reason one answer is "rejected" is hallucination — confidently describing an object that isn't actually in the picture. Collecting even a few hundred such image-grounded preference pairs and fine-tuning with DPO measurably cuts that hallucination, showing that alignment for multimodal models is less about new algorithms than about preference data anchored to what the image really contains.

Key Insight​

Key Insight