The Problem: Images Are Not Sequences
Transformers were designed for sequences of tokens. Text is naturally sequential — one word follows another, and the order carries meaning. Images, however, are 2D grids of pixels, typically represented as $H \times W \times 3$ tensors (height, width, and three RGB colour channels). A standard $224 \times 224$ image contains $224 \times 224 \times 3 = 150{,}528$ raw pixel values. If we tried to treat each pixel as an individual token, we would end up with a sequence of $224 \times 224 = 50{,}176$ positions — far too long for self-attention , which scales quadratically with sequence length ($O(n^2)$). At 50K tokens, the attention matrix alone would have over 2.5 billion entries per layer per head.
Convolutional neural networks (CNNs) solved computer vision by exploiting spatial locality: each convolutional layer looks at a small patch of the input (typically $3 \times 3$ or $5 \times 5$ pixels), and deeper layers combine information from larger regions, gradually building up from local edges and textures to global understanding of objects and scenes. This hierarchical approach works well, but CNNs don't easily integrate with the transformer-based language models that sit at the heart of vision-language models. If we want a unified architecture where images and text can interact through the same attention mechanism, we need a way to turn images into token sequences that transformers can process natively.
The Vision Transformer (ViT) (Dosovitskiy et al., 2021) provides an elegant solution: split the image into fixed-size patches, treat each patch as a "word", and feed the resulting sequence into a standard transformer encoder. No convolutions, no pooling layers, no architectural changes to the transformer itself — just a new way of tokenising the input.
Patch Embedding: Splitting Images into Tokens
The core idea behind ViT is to divide the image into a grid of non-overlapping square patches of size $P \times P$ pixels, flatten each patch into a vector, and linearly project it into the transformer's embedding dimension. This converts a 2D image into a 1D sequence of patch tokens that a transformer can process exactly as it processes word tokens in text.
Given an image of size $H \times W \times C$ (height, width, and channels — typically $C = 3$ for RGB) and a patch size $P$, the procedure is:
- Step 1: The image is split into $N_p = \frac{H}{P} \times \frac{W}{P}$ non-overlapping patches. Each patch covers a $P \times P$ region of the original image.
- Step 2: Each patch is a $P \times P \times C$ tensor, which we flatten into a single vector $\mathbf{x}_p^i \in \mathbb{R}^{P^2 C}$. This just concatenates all the pixel values of the patch into one long vector.
- Step 3: A learnable linear projection $\mathbf{E} \in \mathbb{R}^{(P^2 C) \times D}$ maps each flattened patch to the model's hidden dimension $D$, producing the patch's initial embedding.
The full patch embedding for patch $i$ is:
Let's break down every component of this formula to understand what each piece does and why it's there.
$\mathbf{x}_p^i$ is the $i$-th flattened patch — a vector of $P^2 C$ raw pixel values. For a $16 \times 16$ RGB patch, this is $16 \times 16 \times 3 = 768$ values. These are just the red, green, and blue intensities for every pixel in the patch, laid out end-to-end in a single vector. At this stage, the representation is high-dimensional but carries no learned features — it's raw sensory data.
$\mathbf{E}$ is the projection matrix that maps from pixel space ($P^2 C$ dimensions) into the transformer's hidden dimension $D$. This is a learnable weight matrix — during training, the model learns what features to extract from raw patches. For ViT-Base with $P = 16$ and $D = 768$, this matrix is $768 \times 768$ — coincidentally square in this particular configuration, but the dimensions generally differ. With $P = 14$ and $D = 1024$ (ViT-Large), the matrix would be $588 \times 1024$. Mathematically, this linear projection is equivalent to a single convolutional layer with kernel size $P \times P$ and stride $P$, applied across the entire image. But conceptually, it serves as the tokeniser: just as a text tokeniser maps character sequences into embedding vectors, $\mathbf{E}$ maps pixel patches into token embeddings.
$\mathbf{e}_{\text{pos}}^i$ is a learned position embedding that tells the transformer where patch $i$ sits in the original image. Without it, the transformer would treat the patch sequence as an unordered set — self-attention is permutation-invariant by design, so shuffling the patches would produce the same outputs (up to the same permutation). The position embedding breaks this symmetry, encoding spatial location. We explain how this works in the next section.
The result $\mathbf{z}_0^i$ is a $D$-dimensional vector — the initial "token" representation for patch $i$, ready to enter the transformer layers. The subscript $0$ indicates this is the input to layer 0 (before any transformer processing). After passing through $L$ transformer layers, this token will carry rich contextual information about not just its own patch, but all other patches it attended to.
Let's ground this with a concrete example. A $224 \times 224$ RGB image with patch size $P = 16$ yields $(224 / 16) \times (224 / 16) = 14 \times 14 = 196$ patches. Each patch is a $16 \times 16 \times 3 = 768$-dimensional vector of raw pixel values. After projection by $\mathbf{E}$, we have 196 tokens of dimension $D = 768$. Compare that to the original pixel-as-token approach: 196 tokens vs. 50,176 pixels. Self-attention on 196 tokens requires a $196 \times 196 = 38{,}416$-entry attention matrix — trivially small compared to the $50{,}176 \times 50{,}176 \approx 2.5$ billion entries we'd need for pixel-level attention.
Position Embeddings: Telling Patches Where They Are
Standard self-attention is permutation-invariant: if you shuffle the input tokens, the output changes only by the same permutation. This is a mathematical property of the dot-product attention mechanism — the query-key similarity scores don't depend on the order of the tokens, only on their content. For text, positional encoding breaks this symmetry because word order matters ("the dog bit the man" and "the man bit the dog" have very different meanings). For images, spatial position matters just as much — a patch in the top-left corner of an image carries different information than one in the bottom-right, even if the pixel values happen to be similar.
ViT uses learned 1D position embeddings : one learnable vector $\mathbf{e}_{\text{pos}}^i \in \mathbb{R}^D$ per patch position $i \in \{1, \ldots, N_p\}$. The patches are numbered in raster order (left-to-right, top-to-bottom, the same order you read a page), and each position index gets its own learned embedding that is added to the corresponding patch embedding. For a $14 \times 14$ grid of patches, position 1 is the top-left patch, position 14 is the top-right, position 15 is the first patch in the second row, and position 196 is the bottom-right.
Why 1D instead of 2D? It might seem natural to encode the row and column of each patch separately — after all, images have two spatial dimensions. The original ViT paper tested explicit 2D positional encodings (where each patch gets a row embedding and a column embedding that are concatenated or added) and found that they perform comparably to the simpler 1D learned embeddings. The model learns the 2D structure implicitly from the data. During training, the position embeddings for spatially adjacent patches naturally end up similar to each other — the model discovers the grid layout on its own, without us having to encode it explicitly.
The CLS Token
ViT prepends one extra learnable token — called $[\text{CLS}]$ (borrowing the name from BERT 's classification token) — to the sequence of patch tokens. This token has no corresponding image patch. The full input sequence to the transformer encoder is:
The total sequence length is $N_p + 1$: the 196 patch tokens plus the one CLS token, giving 197 tokens for a $224 \times 224$ image with $P = 16$. The CLS token is initialised randomly and participates fully in self-attention — it can attend to every patch token, and every patch token can attend to it. Through the transformer's layers, it learns to aggregate information from all patches into a single global representation.
After the final transformer layer $L$, the CLS token's output serves as the global image representation:
Here $\mathbf{z}_L^{\text{CLS}}$ is the CLS token's representation after all $L$ transformer layers have processed it, and LayerNorm normalises the vector to stabilise its scale before it is used downstream.
Why do we need this? In many applications — image classification, contrastive learning with text in CLIP, similarity search — we need a single fixed-size vector to represent the entire image. The CLS token acts as a learnable "summary": since self-attention lets it attend to every patch, it can learn to extract and combine the most relevant information from across the image into one $D$-dimensional vector. In the first few layers, the CLS token might attend broadly to all patches; in later layers, it tends to focus on the most informative regions.
An alternative to the CLS token is global average pooling (GAP): simply take the mean of all patch token outputs after the final layer: $\mathbf{v}_I = \frac{1}{N_p} \sum_{i=1}^{N_p} \mathbf{z}_L^i$. Some ViT variants use GAP instead of the CLS token and achieve comparable results. The tradeoff is that GAP gives equal weight to all patches, while the CLS token can learn to weight patches differently through attention — potentially useful when some patches (e.g., those containing the main object) are more important than others (e.g., background patches).
ViT Configurations and Scale
The original ViT paper introduced several model configurations at different scales, following a naming convention of ViT-{size}/{patch size}. The three standard configurations are:
- ViT-B/16 (Base, patch 16): 12 transformer layers, hidden dimension $D = 768$, 12 attention heads, 86M parameters, producing 196 patches for a $224 \times 224$ image.
- ViT-L/14 (Large, patch 14): 24 transformer layers, hidden dimension $D = 1024$, 16 attention heads, 307M parameters, producing 256 patches for a $224 \times 224$ image.
- ViT-H/14 (Huge, patch 14): 32 transformer layers, hidden dimension $D = 1280$, 16 attention heads, 632M parameters, producing 256 patches for a $224 \times 224$ image.
The naming convention encodes the two most important design choices: model size (Base, Large, or Huge) and patch size (which determines the number of tokens). CLIP's best-performing image encoder is ViT-L/14, and many subsequent VLMs — including SigLIP and OpenVLA — use it or a close variant as their visual backbone.
Larger models with smaller patches tend to perform better, but the cost grows steeply. ViT-H/14 has roughly 7 times the parameters of ViT-B/16 (632M vs. 86M), and it processes 30% more tokens per image (256 vs. 196). But parameters are only part of the compute story. The attention cost scales quadratically with sequence length: $256^2 / 196^2 \approx 1.7$, so attention alone is 70% more expensive per layer — and ViT-H/14 has 2.7 times more layers (32 vs. 12) than ViT-B/16. Altogether, ViT-H/14 is considerably more expensive per image, which is why the choice of ViT variant is always a tradeoff between accuracy and computational budget.
How ViT Fits into VLMs
In CLIP (covered in the previous article), the ViT acts as the image encoder: each image passes through the full ViT, and the CLS token output becomes $\mathbf{v}_I$, the image embedding used for contrastive learning against text embeddings. Only the single CLS vector is used — all the spatial information encoded in the individual patch tokens is discarded after the ViT, compressed into one $D$-dimensional summary.
In generative VLMs — models like LLaVA, BLIP-2, and Flamingo (which we will cover in later articles in this track) — the approach is different and more expressive. Instead of using just the CLS token, these models keep all patch token outputs $\mathbf{z}_L^1, \mathbf{z}_L^2, \ldots, \mathbf{z}_L^{N_p}$ from the ViT and feed them into the language model as visual context tokens. The language model then attends to these patch tokens alongside text tokens, enabling it to reason about specific spatial regions of the image.
This preserves spatial information that the CLS token might compress away. For example, if a user asks "What colour is the object in the top-left corner?", the language model can attend specifically to the patch tokens from that region. The tradeoff is a longer input sequence for the LLM to process: 196 extra tokens (for ViT-B/16) or 256 (for ViT-L/14) are prepended to the text tokens, increasing both memory usage and compute. Some models use a projection layer or a cross-attention module to compress the patch tokens before feeding them to the LLM, reducing the sequence length while trying to preserve the most important visual information.
Quiz
Test your understanding of the Vision Transformer architecture, from patch embedding to the CLS token.
Why can't we simply treat each pixel of a 224×224 image as a separate token in a transformer?
In the patch embedding formula $\mathbf{z}_0^i = \mathbf{x}_p^i \mathbf{E} + \mathbf{e}_{\text{pos}}^i$, what role does the projection matrix $\mathbf{E}$ play?
Why does ViT need position embeddings added to the patch tokens?
How does the CLS token produce a global image representation?