CLIP's Scaling Bottleneck

CLIP's InfoNCE loss computes a softmax over the entire batch. For a batch of $N$ image-text pairs, the loss for a single image $I_i$ finding its matching text is:

$$\mathcal{L}_i = -\log \frac{\exp(z_{ii} / \tau)}{\sum_{j=1}^{N} \exp(z_{ij} / \tau)}$$

where $z_{ij} = \text{sim}(I_i, T_j)$ is the cosine similarity between the embedding of image $i$ and the embedding of text $j$, and $\tau$ is the learned temperature parameter. The critical piece to notice here is the denominator: $\sum_{j=1}^{N} \exp(z_{ij} / \tau)$. This sum runs over all $N$ texts in the batch. To compute this sum, every single similarity score $z_{i1}, z_{i2}, \ldots, z_{iN}$ must be available — which means every GPU must see every pair.

In distributed training across many GPUs, this creates a hard constraint: the full $N \times N$ similarity matrix must be synchronised before the softmax normalisation can happen. Each GPU computes embeddings for its local mini-batch, but then all embeddings must be gathered across all GPUs so that every device can compute the complete denominator. For CLIP's training batch size of $N = 32{,}768$, that similarity matrix has $32{,}768 \times 32{,}768 \approx 1.07$ billion entries. Gathering all embeddings and computing this matrix becomes a communication bottleneck — the GPUs spend significant time waiting for each other rather than doing useful computation.

CLIP handled this by distributing the batch across GPUs and performing an all-gather operation to collect all embeddings before computing the softmax. It works, but it is expensive, adds engineering complexity, and the communication cost grows linearly with the number of GPUs. SigLIP offers a cleaner solution.

SigLIP: Sigmoid Loss for Language-Image Pre-training

SigLIP (Zhai et al., 2023) replaces the softmax-based contrastive loss (which requires global normalisation across the entire batch) with a sigmoid applied independently to each image-text pair. The loss becomes:

$$\mathcal{L} = -\frac{1}{N^2} \sum_{i=1}^{N} \sum_{j=1}^{N} \left[ y_{ij} \log \sigma(z_{ij}) + (1 - y_{ij}) \log(1 - \sigma(z_{ij})) \right]$$

This formula looks dense, but every piece has a clear purpose. Let's walk through each component.

$z_{ij} = \text{sim}(I_i, T_j) \cdot e^\tau + b$ : the scaled similarity between image $i$ and text $j$. The raw cosine similarity $\text{sim}(I_i, T_j)$ lies in $[-1, 1]$, and is then scaled by a learned log-temperature $e^\tau$ and shifted by a learned bias $b$. The exponential ensures the temperature is always positive. The bias $b$ lets the model shift the decision boundary — without it, a cosine similarity of 0 would always map to $\sigma(0) = 0.5$, which might not be the right threshold for distinguishing matches from non-matches. By learning both $\tau$ and $b$, the model controls how aggressively it separates positives from negatives.

$y_{ij} = \mathbb{1}[i = j]$ : an indicator variable that equals 1 when image $i$ and text $j$ are a genuine matched pair (i.e., they came from the same training example), and 0 otherwise. In a batch of $N$ pairs, there are exactly $N$ positives (the diagonal of the $N \times N$ matrix, where $i = j$) and $N^2 - N$ negatives (all off-diagonal entries, where $i \neq j$). For a batch of $N = 1{,}024$, that is 1,024 positives and roughly 1,047,552 negatives — the vast majority of pairs are negatives, which is typical for contrastive learning.

$\sigma(z_{ij})$ : the sigmoid function, $\sigma(x) = 1 / (1 + e^{-x})$. It maps the scaled similarity score to a probability in the range $(0, 1)$ — this is the model's predicted probability that $(I_i, T_j)$ is a genuine matched pair. When $z_{ij}$ is large and positive, $\sigma(z_{ij}) \to 1$ (confident match). When $z_{ij}$ is large and negative, $\sigma(z_{ij}) \to 0$ (confident non-match). When $z_{ij} = 0$, $\sigma(0) = 0.5$ (maximally uncertain).

The two terms inside the brackets implement binary cross-entropy for each pair:

  • When $y_{ij} = 1$ (genuine pair): only the first term survives, giving $\log \sigma(z_{ij})$. This is maximised when $\sigma(z_{ij}) \to 1$, i.e., when the model confidently predicts that this image-text pair is a match. If the model assigns $\sigma(z_{ij}) = 0.99$, the contribution is $\log(0.99) \approx -0.01$ — a tiny penalty. If the model is uncertain ($\sigma(z_{ij}) = 0.5$), the contribution is $\log(0.5) \approx -0.69$ — a larger penalty. If the model is wrong ($\sigma(z_{ij}) = 0.01$), it is $\log(0.01) \approx -4.6$ — a very large penalty.
  • When $y_{ij} = 0$ (non-match): only the second term survives, giving $\log(1 - \sigma(z_{ij}))$. This is maximised when $\sigma(z_{ij}) \to 0$, i.e., when the model confidently predicts that this pair is NOT a match. The penalty structure is symmetric: a confident correct prediction ($\sigma = 0.01$) gives $\log(0.99) \approx -0.01$, while a confident wrong prediction ($\sigma = 0.99$) gives $\log(0.01) \approx -4.6$.

This is exactly standard binary cross-entropy (the same loss used in logistic regression), applied independently to each of the $N^2$ image-text pairs in the batch.

$\frac{1}{N^2}$ : normalises by the total number of pairs (both positive and negative). Without this normalisation, the loss would scale with batch size, making hyperparameter tuning difficult across different batch sizes. With it, the loss is an average over all pairs, giving a consistent scale regardless of $N$.

Why this removes the bottleneck: each $(i, j)$ term inside the double sum depends only on the embeddings of image $i$ and text $j$ — there is no denominator that sums over all other pairs. Compare this to CLIP's softmax, where computing the loss for image $i$ requires the similarity scores $z_{i1}, z_{i2}, \ldots, z_{iN}$ between image $i$ and every text in the batch. In SigLIP, the term for pair $(i, j)$ needs only $z_{ij}$ — one similarity score, involving two embeddings. This means different GPUs can compute their local subset of $(i, j)$ terms independently, using only the embeddings stored on that device, and then average the results. No global synchronisation of the similarity matrix is required. Each GPU computes the loss for its local pairs, and only the scalar loss values (not the full embedding matrix) need to be aggregated.

💡 In practice, SigLIP achieves comparable or better performance than CLIP while being simpler to distribute (Zhai et al., 2023). SigLIP-B/16 trained on the same data as CLIP-B/16 tends to match or exceed it on zero-shot ImageNet, and the gap widens with more GPUs because SigLIP scales more gracefully — adding devices gives near-linear speedup without the communication overhead of all-gathering embeddings for a global softmax.

DINOv2: Self-Supervised Vision Without Text

Both CLIP and SigLIP learn visual representations by aligning images with text — they need image-text pairs for training. DINOv2 (Oquab et al., 2024) takes a fundamentally different approach: it learns visual representations using only images , with no text at all. The key idea is self-distillation — the model learns by trying to match its own outputs across different views of the same image.

DINOv2 uses a student-teacher framework with three main components:

  • Two networks: a student $f_s$ and a teacher $f_t$, both with the same architecture (a Vision Transformer). Crucially, the teacher is not trained with gradient descent. Instead, its weights are an exponential moving average (EMA) of the student's weights, updated after each training step as: $\theta_t \leftarrow m \cdot \theta_t + (1 - m) \cdot \theta_s$, where $m \approx 0.996$. This means the teacher's weights are a smoothed, slowly-evolving version of the student's — after each update, the teacher moves just 0.4% toward the student's current weights. This slow evolution provides a stable training target that avoids the collapse problems that plague naive self-supervised methods.
  • Two views of the same image: from a single training image, create a "global" crop (covering a large portion of the image, typically 50-100%) and a "local" crop (covering a smaller portion, typically 5-50%). Both crops represent the same underlying scene from different perspectives and scales. The model must recognise that both views depict the same content.
  • Asymmetric input: the student sees the local crop, and the teacher sees the global crop. The student must predict the teacher's output from its limited, zoomed-in view. This forces the student to learn representations that capture the broader context of the scene — it must infer what the full image looks like from just a small piece of it.

The loss is a cross-entropy between the teacher's and student's output distributions:

$$\mathcal{L}_{\text{DINO}} = -\sum_{k} p_t^{(k)} \log p_s^{(k)}$$

Let's unpack each part of this formula.

$p_t^{(k)}$ and $p_s^{(k)}$ are probability distributions over $K$ dimensions, produced by applying a softmax with temperature to the CLS token outputs of the teacher and student respectively. Specifically, $p_t^{(k)} = \exp(g_t^{(k)} / \tau_t) / \sum_{k'} \exp(g_t^{(k')} / \tau_t)$ where $g_t$ is the teacher's output logits and $\tau_t$ is the teacher's temperature, and similarly for the student with temperature $\tau_s$. The teacher uses a lower temperature ($\tau_t \approx 0.04$) which produces a sharper, more peaked distribution — it concentrates probability mass on fewer dimensions, effectively saying "I am confident this image strongly activates these particular features." The student uses a higher temperature ($\tau_s \approx 0.1$), producing a softer, more spread-out distribution. The cross-entropy loss then pushes the student to sharpen its own distribution to match the teacher's confident predictions.

The sum $\sum_k$ runs over all $K$ dimensions of the output distribution. Each dimension can be thought of as a learned visual "concept" — not predefined classes like in supervised learning, but abstract features that the model discovers through training. The cross-entropy is minimised when $p_s = p_t$, i.e., when the student's distribution exactly matches the teacher's. If the teacher assigns high probability to dimension $k$ (meaning this feature is strongly present in the global view), the student must also assign high probability to $k$ from its local view alone.

Why does this work? The teacher (with EMA weights) provides a slowly-evolving, stable target. Because its weights change gradually, it does not chase the student's rapid updates — it acts like a smoothed consensus of where the student has been over recent training steps. The student tries to match this stable target. Because the student sees only a local crop while the teacher sees the full image, the student must learn to infer global structure from local patterns. If the teacher sees a beach scene and confidently activates "sand", "water", and "sky" features, the student — seeing only a close-up of sand with a few shells — must learn that this local pattern implies the broader context. Over many training steps, this forces the student to build representations that capture rich spatial relationships: how parts relate to wholes, how textures relate to objects, how local details relate to global scene structure.

There is also a crucial mechanism to prevent collapse. Without any safeguard, the teacher could simply output a uniform distribution for every image (the trivial solution where both teacher and student agree on a meaningless constant). DINOv2 prevents this by centring the teacher's outputs: it subtracts a running mean of the teacher's logits before applying the softmax. This forces the teacher to produce different outputs for different images, because the mean is subtracted away and only the relative differences survive.

💡 DINOv2 features tend to excel at spatial tasks: depth estimation, semantic segmentation, object detection — tasks where CLIP features are often weaker (Oquab et al., 2024). This is because CLIP optimises for global image-text alignment (what is in the image), while DINOv2 optimises for local-to-global visual consistency (where things are and how they relate spatially). When you need to know that the table is in front of the chair and the cup is on the table, DINOv2's features tend to encode those spatial relationships more faithfully than CLIP's.

When to Use Which

We now have three vision encoders in our toolkit — CLIP, SigLIP, and DINOv2 — each optimised for different things. Understanding when to use which is important for building effective vision-language systems.

  • CLIP / SigLIP: best when you need vision-language alignment — zero-shot classification, text-image retrieval, or as the visual backbone for a VLM. Both produce embeddings that live in a shared space with text, so you can directly compare images against language descriptions. SigLIP is generally preferred over CLIP for new projects: it achieves comparable or better accuracy with simpler distributed training, and the sigmoid loss avoids the engineering complexity of all-gathering embeddings for softmax normalisation.
  • DINOv2: best when you need rich spatial and geometric features — depth estimation, semantic segmentation, object localisation, pose estimation — and do not need text alignment. DINOv2 embeddings carry fine-grained spatial information because the self-distillation objective forces the model to understand how local patches relate to the global scene. However, DINOv2 features do not live in a shared space with text, so you cannot directly use them for zero-shot classification with language prompts.
  • Both together: some models use SigLIP for semantic understanding AND DINOv2 for spatial features, concatenating both encoder outputs before feeding them to the downstream model. This dual-encoder approach captures complementary information: SigLIP tells the model what is in the image (semantic content aligned with language), while DINOv2 tells the model where things are and how they relate spatially (geometric structure). For example, OpenVLA (covered in the VLA track) concatenates SigLIP and DINOv2 features to give its robotic policy both the semantic understanding needed to interpret language commands ("pick up the red cup") and the spatial precision needed to execute physical actions (reaching to the correct location).

The intuition for why these features are complementary comes directly from the training objectives. CLIP and SigLIP are trained to answer the question "does this image match this text?" — a global, semantic question. The loss rewards collapsing an entire image into a single vector that captures its meaning. DINOv2 is trained to answer "can you predict the full scene from this local crop?" — a spatial, structural question. The loss rewards preserving the geometric relationships between parts of the image. Neither objective alone captures everything, but together they cover both the "what" and the "where".

Quiz

Test your understanding of SigLIP's sigmoid loss, DINOv2's self-distillation, and when to use each encoder.

What is the key engineering problem with CLIP's softmax-based loss that SigLIP solves?

In SigLIP's loss, why can each $(i, j)$ pair be computed independently?

How does DINOv2 learn visual features without any text supervision?

Why do some VLMs (like OpenVLA) use both SigLIP and DINOv2 encoders rather than just one?