Modality Survey
Key Insight
Almost every multimodal paper can be pinned down by two coordinates: how it joins the modalities — its fusion point — and what objective trains it, whether contrastive, masked (hide part of the input and predict the missing piece), or generative (predict the next token). Read through that lens, the field stops being a flood of unrelated names and becomes a small grid of recombinations: a dual encoder like CLIP sits at "late + contrastive," a vision-language model like LLaVA at "middle + generative," and an early-fusion model like Chameleon at "early + generative." Forcing yourself to read five real papers and write down those two coordinates for each builds the single most useful habit for staying oriented as new models arrive almost every week.