Compare Encoders
Key Insight
The fairest way to ask "which encoder sees the world best" is to freeze each one, pull its features for the same set of ImageNet images, and fit a linear probe on top: if a single linear layer can separate the classes, the encoder already did the hard work. Comparing a convolutional ResNet-50 against a ViT, a contrastively trained SigLIP, and a label-free self-supervised DINOv2 on one probe reveals that how a model was trained often matters more than its architecture — DINOv2, which never saw a label, frequently beats supervised towers, which is why it has become a default off-the-shelf vision backbone in 2026.