Skip to main content

Implement InfoNCE

Key Insight

Writing InfoNCE — the contrastive loss at the heart of CLIP and almost every dual encoder — by hand demystifies it: once you have L2-normalized the image and text embeddings and built the N×N cosine-similarity grid with a single matmul, the loss is just cross-entropy with the softmax pushed toward the diagonal. The subtlety the formula hides is that it is symmetric — you run it once down the rows (each image picks its caption) and once down the columns (each caption picks its image), then average the two; skipping one half quietly biases the model toward one modality. Verifying the gradients against a finite-difference estimate — nudging one input by a tiny ε and checking the loss moves by the amount the gradient predicts — is the cheapest way to catch a sign flip or a wrong axis before you waste a full training run.