The Modality Gap

An image classifier trained on ImageNet can look at a photo and label it "golden retriever" — but it cannot answer "what is the dog doing?" or "is this a good pet for an apartment?" . It sees pixels, matches patterns, and outputs one of a thousand pre-defined labels. Meanwhile, a language model can discuss dogs eloquently — breeds, temperaments, training tips — but it has never seen a single pixel. It knows the word "fluffy" as a token that tends to appear near "fur" and "soft", but it has no concept of what fluffy actually looks like. These two worlds — vision and language — developed independently, each powerful in its own domain but completely blind to the other.

The reason they're blind to each other is that images and text live in fundamentally different representational spaces. An image is a 3D tensor of RGB pixel values — a 224×224 photo, for instance, is a grid of 150,528 numbers between 0 and 255. Text, on the other hand, is a sequence of discrete vocabulary indices — each word or subword maps to an integer in a fixed dictionary. There's no natural bridge between these formats. A model trained only on images has no concept of "fluffy" as a word; a model trained only on text has no concept of "fluffy" as a texture you can see in a photo.

Vision-Language Models (VLMs) bridge this gap by learning a shared embedding space where images and text can be compared directly. The key insight is deceptively simple: if the text description "a golden retriever playing fetch in a park" and a photo of exactly that scene both map to nearby points in the same vector space, we unlock an entirely new class of capabilities. The image doesn't need a label — it has a position in a space where language already lives, and that position tells us what the image means.

The model that proved this idea works at scale is CLIP (Radford et al., 2021) , which trained an image encoder and a text encoder jointly on 400 million image-text pairs scraped from the internet. We'll cover CLIP in depth in article 2. For now, the important takeaway is that shared embedding spaces are not a theoretical curiosity — they're the foundation of every practical VLM today.

What Does Alignment Unlock?

Once images and text share the same embedding space, a single similarity metric — typically cosine similarity — works in both directions. This unlocks capabilities that are impractical or infeasible with separate image and text models:

  • Zero-shot classification: no labelled training data needed. To classify an image, you encode it and compare its embedding against text descriptions of every candidate class ("a photo of a cat", "a photo of a dog", "a photo of a car"). The class whose text embedding is closest to the image embedding wins. An ImageNet classifier needs 1.2 million labelled images and can only recognise the categories it was trained on; a VLM with zero-shot classification needs zero labelled images and can handle any category you can describe in words.
  • Cross-modal search: search a database of millions of images using a text query like "sunset over ocean with a sailboat", or go the other direction — given an image, find text descriptions that match it. The same similarity metric works both ways because images and text occupy the same space. This is the technology behind image search in products like Google Photos, Unsplash, and many stock photo platforms.
  • Visual question answering (VQA): given an image and a natural language question ("How many people are in this photo?", "What colour is the car on the left?"), a VLM can reason about both modalities together to produce an answer. This goes far beyond classification — it requires understanding spatial relationships, counting, reading text in images, and more. We cover VQA architectures in depth in article 6.
  • Guiding image generation: models like DALL-E and Stable Diffusion use CLIP-like text encoders to condition image generation on text prompts. The text embedding tells the image generator what to create. Without a shared embedding space that captures the meaning of both text and images, text-to-image generation as we know it wouldn't work.
💡 The shared embedding space is the foundational idea behind every VLM in this track. Articles 2-4 cover how to build it; articles 5-6 cover how to connect it to large language models.

Why Not Just Train More Classifiers?

The naive approach to computer vision is: for every new task, collect labelled data and train a supervised model. Want to detect 1,000 object categories? Collect and label training examples for each. Want to add "is this photo safe for work?" — collect more data, train another classifier. Need to distinguish dog breeds for a pet adoption app? More labels, another model.

This approach doesn't scale, for several compounding reasons:

  • Labelling cost: human annotation is slow and expensive. ImageNet — the dataset that powered a decade of computer vision research — took years and millions of dollars to annotate with its 14 million labels. Every new task demands a similar investment, and the labels need to be high-quality or the classifier learns the wrong patterns.
  • Closed-world assumption: a classifier only knows the categories it was trained on. If you trained a pet classifier on cats, dogs, and hamsters, and a user uploads a photo of a pangolin, the model has no option but to misclassify it as whichever trained category it vaguely resembles. It cannot say "I don't know this animal" — the concept of a pangolin simply doesn't exist in its label space.
  • No compositionality: "red car" and "blue car" become separate classes rather than compositions of colour + object. Want to recognise "red car at night" and "blue car in rain"? Those are more separate classes. The label space grows combinatorially with every attribute you want to distinguish, and training data must be collected for each combination.

VLMs sidestep all three problems. They learn from image-text pairs scraped from the web — hundreds of millions of them, with the alt-text and captions as free supervision that never needed a human annotator. They handle open-vocabulary concepts: any text description works as a "class", including descriptions of objects the model never explicitly saw during training, because the text encoder generalises from its language understanding. And they compose naturally, because text is compositional by nature — the phrase "red car at night" is just a sequence of tokens, not a new category to be registered.

What's Ahead in This Track

This track builds up the full VLM stack, from low-level image-text alignment to conversational AI that can see. Here's the roadmap:

  • Article 2 — CLIP: in article 2 we dive into CLIP, the model that proved contrastive pre-training on image-text pairs could match supervised classifiers trained on millions of labelled examples. We'll cover the contrastive loss, the dual-encoder architecture, and why it learns such transferable representations.
  • Article 3 — Vision Transformers: in article 3 we cover Vision Transformers (ViT), the architecture that turns images into sequences of tokens that transformers can process — the visual backbone inside most modern VLMs.
  • Article 4 — SigLIP & DINOv2: in article 4 we explore two important improvements: SigLIP, which scales contrastive learning to larger batches by replacing the softmax with a sigmoid loss, and DINOv2, which learns powerful visual features without any text supervision at all.
  • Article 5 — Multimodal Fusion: in article 5 we tackle the fusion problem: how do you connect a vision encoder to a large language model so the LLM can "see"? We'll cover projection layers, cross-attention, and the architectural choices that determine how visual and textual information are combined.
  • Article 6 — Visual Instruction Tuning: finally, in article 6 we cover visual instruction tuning — the training recipe that teaches an LLM to hold open-ended conversations about images, answer visual questions, and follow complex instructions that reference visual content.

We'll start with the model that launched the field.

Quiz

Test your understanding of the motivation behind Vision-Language Models.

What is the core problem that VLMs address?

What does a shared embedding space enable that separate image and text models cannot do?

Why is zero-shot classification a significant advantage over traditional classifiers?

What is the "closed-world" limitation of traditional classifiers that VLMs overcome?