Skip to main content

LLaVA from Scratch

Key Insight

This project rebuilds the LLaVA recipe by hand: bolt a frozen CLIP-ViT image encoder onto a frozen 1–3B-parameter LLM using nothing but a small projector — a single linear layer or two-layer MLP — that rewrites each image patch's feature vector into the LLM's word-embedding space. A small 1–3B LLM is the deliberate pick over a larger one: it is fluent enough to caption yet light enough to train on a single GPU, and because both big networks stay frozen, the only weights learning are that thin bridge. That is why stage-1 alignment on COCO captions is cheap and stable — you are just teaching the projector to aim image features at the right words, not retraining a VLM end to end.