Skip to main content

Toy Retrieval

Key Insight

Cross-modal retrieval sounds elaborate but reduces to nearest-neighbor search in a shared space. You encode a few hundred images and their captions with CLIP once, then to serve a query you compare its embedding against every stored embedding and return the closest five (a top-k lookup with k=5). "Closest" here means highest cosine similarity, and because one matrix multiplication is just a big batch of dot products, a single matmul scores every caption against every image at once. The whole lesson is that retrieval is "encode each item once, then compare with cheap dot products" — the exact same primitive that scales from this toy up to billion-vector search engines.