The Dual-Encoder Architecture
CLIP (Radford et al., 2021) uses two separate encoders that never directly see each other's inputs. Each encoder processes its own modality in isolation, and the only point of contact between them is a shared embedding space where their outputs are compared.
- Image encoder (either a ResNet or a Vision Transformer): maps an image to a fixed-size vector $\mathbf{v}_I \in \mathbb{R}^d$. The ViT variant splits the image into patches, projects each patch into an embedding, and processes the sequence through transformer layers, using the [CLS] token output as the image representation.
- Text encoder (a Transformer): maps a caption or description to a vector $\mathbf{v}_T \in \mathbb{R}^d$ of the same dimensionality. It tokenises the input text, processes it through transformer layers, and takes the embedding at the [EOS] token position as the text representation.
Both vectors are L2-normalised so they live on the unit hypersphere ($\|\mathbf{v}_I\| = \|\mathbf{v}_T\| = 1$), and similarity is measured with the cosine, which for unit vectors reduces to the dot product:
Cosine similarity ranges from $-1$ (vectors pointing in opposite directions) through $0$ (orthogonal vectors, completely unrelated) to $+1$ (vectors pointing in the same direction, a perfect match). By normalising both vectors to unit length, we remove any magnitude effects — a brighter image or a longer caption does not get a higher similarity score just because its raw vector happened to be larger. The only thing that matters is the angle between the two vectors, which captures how semantically aligned the image and text are.
The Contrastive Loss: InfoNCE
The heart of CLIP's training is a contrastive loss called InfoNCE (Noise-Contrastive Estimation with Information maximisation). The setup is straightforward: a training batch contains $N$ image-text pairs $(I_1, T_1), \ldots, (I_N, T_N)$. Pair $(I_i, T_i)$ is a genuine match — the text describes the image. All other combinations $(I_i, T_j)$ where $i \neq j$ are treated as non-matches. The model must learn to push matching pairs close together in the shared embedding space while pushing non-matching pairs apart.
The loss for a single image $I_i$ finding its matching text among all $N$ texts in the batch is:
Every piece of this formula has a specific job. Let's walk through them one by one.
The numerator $\exp(\text{sim}(I_i, T_i) / \tau)$: this is the similarity of the correct pair, exponentiated and scaled by a temperature parameter $\tau$. The exponential ensures the value is always positive (needed for a valid probability distribution), and temperature-scaling controls how much a small difference in similarity scores gets amplified. We want this numerator to be as large as possible — the model should assign the highest similarity to the correct image-text pair.
The denominator $\sum_{j=1}^{N} \exp(\text{sim}(I_i, T_j) / \tau)$: the sum runs over all $N$ texts in the batch — the correct match $T_i$ plus all $N-1$ distractors. Together, the numerator divided by the denominator forms a softmax over similarities, turning them into a probability distribution that sums to 1. The loss asks: what fraction of the total exponentiated similarity mass lands on the correct text? If the model assigns all the mass to the correct text, this fraction is 1. If the model cannot distinguish the correct text from distractors, the mass spreads evenly and the fraction drops to $1/N$.
The $-\log$ : converts the probability into a loss value. If the model assigns probability 1.0 to the correct text, $-\log(1) = 0$ — perfect, no loss. If it assigns probability $1/N$ (random chance, the worst case for a uniform distribution), $-\log(1/N) = \log N$ — maximum loss for uniform guessing. So the loss ranges from $0$ (perfect discrimination) to $\log N$ (random guessing), giving us a natural scale where the upper bound grows logarithmically with batch size. For CLIP's batch size of $N = 32{,}768$, the random-chance loss would be $\log(32768) \approx 10.4$.
The temperature $\tau$ : a learned scalar (initialised around 0.07 in CLIP). It controls the sharpness of the softmax distribution. To understand this, consider two extremes:
- When $\tau \to 0^+$ , the softmax approaches a hard argmax — the model becomes extremely confident, assigning nearly all probability to the most similar text. Dividing similarities by a tiny $\tau$ amplifies small differences into huge ones, so even a slight edge in similarity makes one pair dominate completely. Gradients become very peaky (large for the winner, near-zero for everything else), which can make training unstable because the model gets almost no learning signal from the non-winning negatives.
- When $\tau \to \infty$ , the softmax flattens toward a uniform distribution. Dividing similarities by an enormous $\tau$ shrinks all differences to near zero, so $\exp(s_i / \tau) \approx 1$ for every pair. All texts look equally similar regardless of actual scores, and the model receives no useful gradient — it cannot tell matches from non-matches.
- The sweet spot is in between: sharp enough to discriminate well between correct and incorrect matches, but soft enough to provide gradient signal to all negatives so the model can learn to push them away.
Symmetry. The loss above asks: "given image $I_i$, can the model find the correct text $T_i$?" But we also need the reverse direction: "given text $T_i$, can the model find the correct image $I_i$?" CLIP computes the loss in both directions. The text-to-image loss is:
The total loss averages both directions across all $N$ pairs in the batch:
Why does symmetry matter? Without the text-to-image direction, the model could learn a degenerate solution: map all images to a single point in embedding space that happens to be close to all texts. The image-to-text loss would be satisfied (each image finds some nearby text), but the image representations would be useless because they are all identical. The bidirectional loss prevents this collapse because the text-to-image direction requires each text to find its unique image, which is only possible if the image embeddings are spread out and distinct. Symmetrically, the image-to-text direction prevents all text embeddings from collapsing. Together, both directions force the model to learn a well-structured, discriminative embedding space.
Why Batch Size Matters
Each batch of $N$ pairs creates a rich training signal. The $N \times N$ similarity matrix contains $N$ positive pairs along the diagonal and $N^2 - N$ negative comparisons off the diagonal. Larger batches mean more negatives per positive, which means a harder discrimination task that forces the model to learn finer-grained representations.
To put this in numbers: with a batch of $N = 256$, each image is compared against 255 wrong texts. The model might get away with learning coarse distinctions ("this is an animal" vs. "this is a building"). With $N = 32{,}768$ (CLIP's actual batch size), each image is compared against 32,767 wrong texts. Now the batch almost certainly contains many semantically similar distractors — multiple dog images with different descriptions, multiple outdoor scenes — so the model must learn to distinguish "a golden retriever playing fetch on a beach" from "a labrador swimming in a lake". The sheer number of negatives forces fine-grained alignment between images and their specific descriptions.
This comes at a cost. All $N$ image embeddings and all $N$ text embeddings must fit in GPU memory simultaneously to compute the full $N \times N$ similarity matrix. At $N = 32{,}768$ with 512-dimensional embeddings in float32, that is roughly $32{,}768 \times 512 \times 4 \text{ bytes} \times 2 \text{ (image + text)} \approx 128$ MB just for the embeddings, plus the $32{,}768 \times 32{,}768$ similarity matrix itself (~4 GB in float32). CLIP distributes this across many GPUs, with each GPU computing a local batch and then gathering embeddings from all other GPUs to form the full similarity matrix.
Zero-Shot Classification
Once trained, CLIP can classify images into categories it has never seen labelled examples of — a capability called zero-shot classification . The trick is to treat classification as a retrieval problem in the shared embedding space:
- Step 1: For each class $c$, create a text prompt: "a photo of a {class name}" (e.g., "a photo of a golden retriever", "a photo of a sports car", "a photo of a pizza").
- Step 2: Encode all prompts through the text encoder to get text embeddings $\{\mathbf{v}_{T_c}\}$. These can be precomputed once.
- Step 3: Encode the test image through the image encoder to get $\mathbf{v}_I$.
- Step 4: Pick the class whose text embedding has the highest cosine similarity to the image:
This is remarkably effective. On ImageNet (1,000 classes), CLIP's best model (ViT-L/14) achieves around 76% top-1 accuracy without seeing a single ImageNet training image (Radford et al., 2021) . For context, a ResNet-50 trained on the full ImageNet training set (1.2 million labelled images, explicitly supervised for these 1,000 classes) also achieves about 76%. CLIP matches that performance using only the natural language understanding it acquired from 400 million image-text pairs scraped from the web — none of which were ImageNet labels.
The prompt template matters. "a photo of a {class}" tends to work better than just "{class}" because it adds context that anchors the text embedding closer to what actual image captions look like in CLIP's training data. OpenAI found that ensembling multiple prompt templates ("a painting of a {class}", "a drawing of a {class}", etc.) and averaging the resulting text embeddings can boost accuracy by a few percentage points.
The code below demonstrates this idea with a simplified example. We simulate 4 image embeddings and 4 text prompt embeddings in 3 dimensions, compute the full cosine similarity matrix, and see which text each image matches best:
import json
import js
import math
# Simulated embeddings (in practice these come from CLIP's encoders)
# 4 images, 4 text descriptions, embedding dim = 3
images = [
[0.9, 0.1, 0.0], # "dog" image
[0.0, 0.8, 0.3], # "cat" image
[0.1, 0.0, 0.95], # "car" image
[0.7, 0.3, 0.1], # "puppy" image
]
texts = [
[0.85, 0.15, 0.05], # "a photo of a dog"
[0.05, 0.75, 0.35], # "a photo of a cat"
[0.15, 0.05, 0.9], # "a photo of a car"
[0.6, 0.4, 0.2], # "a photo of a puppy"
]
def norm(v):
mag = math.sqrt(sum(x**2 for x in v))
return [x / mag for x in v]
def dot(a, b):
return sum(x * y for x, y in zip(a, b))
# Normalise
images_n = [norm(v) for v in images]
texts_n = [norm(v) for v in texts]
# Compute similarity matrix
labels_i = ["dog img", "cat img", "car img", "puppy img"]
labels_t = ["dog txt", "cat txt", "car txt", "puppy txt"]
table_rows = []
for i, iv in enumerate(images_n):
row = [dot(iv, tv) for tv in texts_n]
best = row.index(max(row))
table_rows.append([labels_i[i]] + [f"{s:.3f}" for s in row] + [labels_t[best]])
js.window.py_table_data = json.dumps({
"headers": ["Image", *labels_t, "Best Match"],
"rows": table_rows
})
print("The diagonal should be highest if the model has learned good alignment.")
print("Notice 'puppy img' matches 'dog txt' well too — CLIP captures semantic similarity.")
Notice how the similarity matrix reveals structure beyond the diagonal. The "puppy img" has high similarity to both "dog txt" and "puppy txt" because puppies and dogs are semantically close. In a real CLIP model with 512 or 768 dimensions, these semantic relationships are captured with much finer granularity, but the principle is the same: the shared embedding space organises concepts by meaning, not by surface-level pixels or characters.
Quiz
Test your understanding of CLIP's architecture, training, and zero-shot capabilities.
In CLIP's contrastive loss, what role does the denominator (sum over all texts in the batch) play?
What happens to the softmax distribution as the temperature $\tau \to 0^+$?
Why does CLIP use both image-to-text and text-to-image loss directions?
How does CLIP perform zero-shot classification on a dataset it was never trained on?