Skip to main content

Mini LAION Pipeline

Key Insight

For a multimodal model the data matters more than the architecture, and raw web data like LAION is mostly unusable straight out of the crawl — full of duplicates, alt-text that has nothing to do with the picture, and tiny or broken images. A practical pipeline chains cheap filters in a fixed order so each one shrinks the work for the next: drop repeated images (deduplication), keep only image–caption pairs that a CLIP model scores as a good match (the CLIP-score filter — the Phase 3 data-filtering trick reused at web scale), then rewrite the weak captions into rich descriptions with a VLM (synthetic captions), and finally pack the survivors into streaming WebDataset shards. The lesson you feel by building it: you throw away 80–90% of a web crawl, and the clean 10–20% that remains trains a far better model than the whole noisy pile ever could.