Multimodal Fusion: Connecting Vision to Language

The Fusion Problem

We now have strong vision encoders (ViT, SigLIP, DINOv2) that produce visual token sequences, and powerful large language models (LLMs) that process text token sequences. The question is: how do we feed visual information into the LLM so it can reason about images alongside text?

This is the fusion problem , and it is not as simple as concatenating pixels to words. The vision encoder outputs live in a different embedding space than the LLM's token embeddings. A ViT-L/14 produces 256 tokens of dimension 1024, while a Llama-2 7B expects tokens of dimension 4096. The two sets of vectors don't share a coordinate system — a direction that means "golden retriever" in the vision encoder's space has no reason to correspond to the same direction in the LLM's space. We need a bridge: a module that translates visual features into a form the LLM can process as if they were ordinary text tokens.

Three dominant approaches have emerged, each making a different tradeoff between simplicity, information preservation, and computational cost. We will walk through all three in detail, examine the formulas that define them, and then compare their strengths and weaknesses so you can understand when each one tends to be the right choice.

Linear Projection (LLaVA)

LLaVA (Liu et al., 2023) takes the simplest possible approach: a single linear layer (or a small MLP) maps each visual token from the vision encoder's dimension to the LLM's dimension. No compression, no learned queries, no architectural surgery on the LLM — just a matrix multiplication that re-expresses each visual token in the LLM's coordinate system:

\mathbf{h}_v^i = \mathbf{W} \mathbf{g}_v^i + \mathbf{b}

Let's break down every component of this formula to understand what each piece does and why it's there.

$\mathbf{g}_v^i \in \mathbb{R}^{d_v}$ : the $i$-th output token from the vision encoder. For ViT-L/14, $d_v = 1024$, and $i$ ranges from 1 to $N_v = 256$ (the number of patch tokens the vision encoder produces). Each of these vectors encodes rich visual information about the corresponding image patch — edges, textures, objects, spatial relationships — but in the vision encoder's own coordinate system, which the LLM cannot natively interpret.

$\mathbf{W} \in \mathbb{R}^{d_{\text{LLM}} \times d_v}$ : a learnable weight matrix that maps from the vision dimension to the LLM dimension. For ViT-L/14 projecting into Llama-2 7B, this is a $4096 \times 1024$ matrix — roughly 4.2 million parameters. During training, the model learns how to rotate and scale the visual features so they land in the right region of the LLM's embedding space. Think of it as a translation dictionary between two languages: the vision encoder "speaks" 1024-dimensional vectors, the LLM "speaks" 4096-dimensional vectors, and $\mathbf{W}$ learns the mapping between them.

$\mathbf{b} \in \mathbb{R}^{d_{\text{LLM}}}$ : a bias vector. It shifts the projected vectors by a constant offset, allowing the model to centre the visual tokens at the right location in the LLM's embedding space even if the vision encoder and LLM have different baseline activation levels.

$\mathbf{h}_v^i \in \mathbb{R}^{d_{\text{LLM}}}$ : the projected visual token, now in the LLM's embedding space. It has the same dimensionality as a text token embedding, so the LLM can treat it exactly like a word — attending to it, building context from it, and generating text conditioned on it, all through the standard self-attention mechanism.

The projected visual tokens are simply concatenated with the text tokens to form the LLM's input sequence:

[\mathbf{h}_v^1, \ldots, \mathbf{h}_v^{N_v}, \mathbf{t}_1, \ldots, \mathbf{t}_m]

where $N_v$ is the number of visual tokens (e.g., 256 from ViT-L/14) and $m$ is the number of text tokens. The LLM processes this combined sequence with its standard self-attention, meaning every text token can attend to every visual token and vice versa. No special cross-attention mechanism is needed — the visual tokens are first-class citizens in the LLM's input.

The advantages of this approach are compelling:

Minimal new parameters: the projection adds only $d_v \times d_{\text{LLM}} + d_{\text{LLM}}$ parameters — roughly 4 million for ViT-L projecting into Llama-2. That is a negligible fraction of the LLM's billions of parameters.
Full information preservation: every single patch token from the vision encoder is passed to the LLM. No visual information is compressed or discarded, so fine-grained spatial details (the position of a small object, the text on a sign) are all available for the LLM to reason about.
Simplicity: the entire fusion mechanism is a single matrix multiplication per token. It is trivial to implement, fast to train, and easy to debug.

The disadvantage is equally clear: the LLM must process all $N_v$ visual tokens (e.g., 256 from ViT-L/14), which adds directly to the sequence length. Since self-attention cost scales quadratically with sequence length, adding 256 visual tokens to a 512-token text prompt means the LLM is now processing 768 tokens — a $2.25\times$ increase in attention cost. For high-resolution images (which produce more patches), multiple images in a conversation, or video frames, this cost escalates quickly.

💡 LLaVA-1.5 upgraded the linear projection to a two-layer MLP with GELU activation: $\mathbf{h}_v^i = \mathbf{W}_2 \, \text{GELU}(\mathbf{W}_1 \mathbf{g}_v^i + \mathbf{b}_1) + \mathbf{b}_2$. This small change improved performance noticeably, suggesting the linear bridge was slightly too restrictive for aligning the two embedding spaces. The nonlinearity lets the projection learn a more flexible mapping — it can, for instance, suppress certain visual features that are irrelevant to language tasks while amplifying others.

Q-Former (BLIP-2)

BLIP-2 (Li et al., 2023) addresses the token count problem head-on. Instead of passing all $N_v$ visual tokens to the LLM, it introduces a learned attention mechanism called the Q-Former (Querying Transformer) that compresses the visual information into a fixed, much smaller set of tokens.

The core idea: introduce $N_q$ learnable query tokens where $N_q \ll N_v$ (typically $N_q = 32$). These queries cross-attend to the full set of visual features and extract the most relevant information, distilling hundreds of visual tokens down to a compact summary:

\mathbf{Q}_{\text{out}} = \text{CrossAttention}(\mathbf{Q}, \mathbf{K}_v, \mathbf{V}_v)

Let's unpack each component of this equation to understand the compression mechanism.

$\mathbf{Q} \in \mathbb{R}^{N_q \times D_q}$ : the learnable query tokens. These are randomly initialised and trained end-to-end with the rest of the model. Think of them as "questions" the model learns to ask about the image. Over the course of training, different queries tend to specialise — one might learn to attend to object identities ("what things are in this image?"), another to spatial layout ("where are they relative to each other?"), another to text or fine-grained details. With $N_q = 32$, the model has 32 such learned questions it can ask of any image. The dimension $D_q$ is the internal dimension of the Q-Former module, typically 768.

$\mathbf{K}_v, \mathbf{V}_v \in \mathbb{R}^{N_v \times D_v}$ : the keys and values derived from the vision encoder's output tokens via linear projections. These are the visual features the queries will attend to. $N_v$ is the number of visual tokens (e.g., 256 for ViT-L/14) and $D_v$ is the projected dimension. The keys determine which visual tokens are relevant to each query (via the attention scores), and the values carry the actual information that gets extracted.

$\mathbf{Q}_{\text{out}} \in \mathbb{R}^{N_q \times D_q}$ : the output — $N_q$ tokens that now carry compressed visual information. Each output query is a weighted combination of the visual value vectors, where the weights are determined by the attention scores between that query and all $N_v$ visual keys. The result is 32 tokens that together summarise the entire image.

To appreciate the compression ratio: with $N_q = 32$ and $N_v = 256$, this is an 8$\times$ compression . The 32 output tokens are then projected into the LLM's dimension (via a linear layer, just like in LLaVA) and concatenated with text tokens. But now the LLM only processes 32 visual tokens instead of 256 — a dramatic reduction in the attention cost. For a 512-token text prompt, the sequence goes from $256 + 512 = 768$ tokens (LLaVA-style) to $32 + 512 = 544$ tokens, saving roughly half the attention computation.

The advantages of Q-Former:

Fixed, controllable number of visual tokens: regardless of image resolution or the number of patches the vision encoder produces, the Q-Former always outputs exactly $N_q$ tokens. This makes computational costs predictable and manageable.
Learned information selection: the cross-attention mechanism lets the model learn what visual information is most important to extract. Queries that attend to irrelevant patches (e.g., blank background regions) will receive low attention scores, so they naturally focus on the informative parts of the image.

The disadvantage is that the compression can be lossy. Squeezing 256 tokens into 32 creates an information bottleneck — some details that were present in the original visual tokens may not survive the compression. This tends to hurt on tasks requiring fine-grained spatial precision, such as counting small objects in a cluttered scene, reading text embedded in images, or answering questions about the relative positions of objects. The 32 queries may not have enough capacity to preserve all the spatial detail that 256 patch tokens carry.

Perceiver Resampler & Gated Cross-Attention (Flamingo)

Flamingo (Alayrac et al., 2022) takes a fundamentally different approach to fusion. Instead of modifying the visual tokens before they enter the LLM (as LLaVA and BLIP-2 do), Flamingo inserts cross-attention layers directly inside the LLM, allowing visual information to influence text processing at every stage of the network.

The architecture has two key components:

Perceiver Resampler: similar to the Q-Former, a set of learnable queries cross-attend to the visual tokens, compressing them into a fixed number of "visual summary" tokens. This handles the token count issue — we don't want the LLM to process hundreds of raw visual tokens at every layer. The resampled tokens are compact representations of the image that can be efficiently consumed by the cross-attention layers.

Gated cross-attention layers: new attention layers interleaved between the LLM's existing self-attention layers. At each of these inserted layers, the LLM's hidden states attend to the visual summary tokens, allowing the text representations to absorb visual information at multiple depths of the network:

\mathbf{h}_{\text{out}} = \mathbf{h}_{\text{in}} + \alpha \cdot \text{CrossAttention}(\mathbf{h}_{\text{in}}, \mathbf{K}_{\text{vis}}, \mathbf{V}_{\text{vis}})

Let's examine each component of this equation carefully.

$\mathbf{h}_{\text{in}}$ : the LLM's hidden states at a given layer — these are the text representations mid-computation, already carrying information from previous self-attention layers. At layer 1, these are close to the raw token embeddings; at deeper layers, they carry increasingly abstract linguistic representations.

$\mathbf{K}_{\text{vis}}, \mathbf{V}_{\text{vis}}$ : keys and values derived from the visual summary tokens (the output of the Perceiver Resampler). The text hidden states query these visual keys and values through standard cross-attention, retrieving visual information relevant to the current text context. Different layers of the LLM may attend to different aspects of the visual information — early layers might pick up low-level features (colours, shapes), while deeper layers access high-level semantics (object identities, relationships).

$\alpha$ : a learned gating scalar, initialised to zero . This is the most critical design choice in the entire Flamingo architecture, and it deserves careful explanation.

Why does the gate matter so much? The LLM being used (e.g., Chinchilla 70B in the original Flamingo paper) was pre-trained purely on text. Its weights encode billions of tokens' worth of language understanding — grammar, facts, reasoning patterns. If we insert randomly initialised cross-attention layers and start training immediately, the random cross-attention outputs would be added to the LLM's hidden states, injecting noise into every layer. This would catastrophically disrupt the LLM's pre-trained text representations on the very first training step, potentially destroying the language capabilities we want to preserve.

By initialising $\alpha = 0$, the cross-attention term $\alpha \cdot \text{CrossAttention}(\ldots)$ is exactly zero at the start of training. The equation reduces to $\mathbf{h}_{\text{out}} = \mathbf{h}_{\text{in}}$ — a pure identity function. The LLM behaves exactly as it did before the cross-attention layers were inserted, producing the same outputs for the same text inputs. As training progresses, $\alpha$ gradually increases from zero, and the model smoothly learns to incorporate visual information at a pace that doesn't destabilise its language capabilities. The gate acts as a safety valve: the model opens it only as fast as it can productively use the visual signal.

The advantages of the Flamingo approach:

Deep fusion: visual information can influence every layer of the LLM, not just the input. In LLaVA and BLIP-2, visual tokens enter at the bottom of the LLM and interact with text only through the LLM's standard self-attention. In Flamingo, dedicated cross-attention layers at multiple depths give the model more opportunities to integrate vision and language — the LLM can refine its text representations using visual context at every stage of processing.
Preservation of pre-trained capabilities: the gating mechanism ensures the LLM's language abilities are not degraded during multimodal training. The original LLM weights can even be kept frozen, with only the cross-attention layers and gates being trained.
Natural handling of interleaved inputs: Flamingo can process conversations with multiple images interspersed with text (e.g., "Look at this image: [image1]. Now compare it with this one: [image2]. What changed?"). Each image's visual tokens are available at the cross-attention layers, and the LLM can selectively attend to the relevant image at each point in the text.

The disadvantages are the flip side of this power:

Architectural complexity: new cross-attention layers and gating parameters must be inserted at every (or every other) LLM layer. For a 32-layer LLM, this could mean 16 new cross-attention modules, each with its own query, key, value projections and gating scalar.
Training and fine-tuning difficulty: modifying the LLM's internals means you need access to the model's architecture, not just its input interface. This makes Flamingo-style fusion harder to apply to closed-source or API-only LLMs.

Comparing Fusion Approaches

Now that we have examined all three approaches in detail, let's place them side by side to highlight the tradeoffs.

Token count passed to LLM: Linear projection (LLaVA): $N_v$ — all visual tokens (e.g., 256 for ViT-L/14). Q-Former (BLIP-2): $N_q$ — a fixed, smaller number (e.g., 32). Flamingo: $N_q$ — also a fixed number via the Perceiver Resampler.
New parameters: Linear projection: very few, roughly 4M for ViT-L projecting into Llama-2. Q-Former: moderate, approximately 100M for the Q-Former module itself. Flamingo: substantial, as cross-attention parameters are added at every LLM layer.
Information preservation: Linear projection: highest — no compression occurs, every patch token reaches the LLM. Q-Former: moderate — the learned compression can be lossy, and fine-grained spatial details may not survive the bottleneck. Flamingo: moderate — similar compression via the Perceiver Resampler, but the deep cross-attention fusion may partially compensate by allowing the model to access visual information at multiple layers.
Architectural complexity: Linear projection: minimal — a single matrix multiplication. Q-Former: moderate — a standalone transformer module that sits between the vision encoder and the LLM. Flamingo: high — requires modifying the LLM's internal architecture by inserting new layers.

In practice, the linear projection approach (LLaVA-style) has become the most widely adopted for its simplicity and surprisingly strong performance. When the vision encoder is good and the LLM is powerful, a simple linear bridge often suffices — the LLM can learn to interpret the projected visual tokens through its own self-attention layers without needing a complex fusion mechanism. The Q-Former approach tends to be favoured when token efficiency is critical — for example, when processing many images in a single context (video understanding, multi-image comparisons) and the quadratic attention cost of hundreds of visual tokens per image becomes prohibitive. Flamingo-style gated cross-attention is typically used in production systems where deep vision-language interaction is worth the additional architectural cost, particularly for few-shot learning and interleaved image-text conversations.

Quiz

Test your understanding of multimodal fusion approaches, from linear projections to gated cross-attention.

What problem does the Q-Former in BLIP-2 solve compared to LLaVA's linear projection?

It aligns the vision encoder's embedding space with the LLM's embedding space using contrastive learning

It compresses the visual tokens into a fixed smaller set using learnable queries, reducing the sequence length the LLM must process

It replaces the vision encoder entirely with a more efficient architecture

It freezes the LLM's weights to prevent catastrophic forgetting during multimodal training

Why does Flamingo initialise its gating parameter $\alpha$ to zero?

To reduce the number of trainable parameters at the start of training

To ensure the cross-attention layers produce outputs with zero mean, which stabilises gradient flow

So the model starts as a pure text LLM and gradually learns to incorporate vision, preserving pre-trained language capabilities

To prevent the visual tokens from dominating the text tokens in the attention computation

What is the main disadvantage of the linear projection approach used in LLaVA?

It introduces too many new parameters, making the model difficult to train

It cannot align the vision encoder's embedding space with the LLM's space

All visual tokens are passed to the LLM, increasing sequence length and quadratic attention cost

It requires modifying the LLM's internal architecture

In BLIP-2's Q-Former, what are the learnable query tokens?

Pre-computed visual features extracted from a frozen vision encoder

Text token embeddings from the LLM's vocabulary that are repurposed for visual inputs

Randomly initialised tokens that learn to cross-attend to visual features and extract the most relevant visual information

Fixed positional encodings that tell the model where each image patch is located