The Contrastive Objective

CLIP (Radford et al., 2021) trains an image encoder and a text encoder jointly on 400M image-text pairs from the web. For a batch of $N$ pairs $(I_i, T_i)$, the model is trained to maximise the cosine similarity of the $N$ correct pairs while minimising the similarity of the $N^2 - N$ incorrect ones.

The resulting representations are remarkably general: CLIP achieves competitive zero-shot classification on ImageNet without seeing a single labelled ImageNet example during training.