Tiny Chameleon
Key Insight
Once an image is just a row of discrete tokens, you can splice it into a sentence and train a single transformer on the mixed stream with one ordinary next-token-prediction loss — the early-fusion recipe behind Chameleon and other native multimodal models. Interleaving image tokens with COCO caption text in one shared vocabulary means the model never sees a seam between "looking" and "reading"; it just predicts the next code, whether that code is a word-piece or a patch of pixels. The lesson is liberating: you do not need a separate vision tower or a special fusion module at all — tokenize everything and let one plain language-model objective do the work.