From Alignment to Conversation

Articles 2–5 covered how to build a shared vision-language space (CLIP, SigLIP) and how to connect vision encoders to LLMs (linear projection, Q-Former, gated cross-attention). But connecting the modules isn't enough — the LLM needs to learn how to use visual information in practice: to answer questions about images, describe what it sees, reason about visual content, and follow instructions that involve both text and images.

This is the problem visual instruction tuning solves. The term comes from LLaVA (Liu et al., 2023) , which showed that a two-stage training recipe — first align, then instruct — can turn a pre-trained LLM into a capable multimodal assistant. The idea is deceptively simple: take a frozen vision encoder, attach it to a frozen LLM through a small projection layer, and then train this system in two careful stages that progressively unlock its multimodal capabilities.

Stage 1: Feature Alignment

In the first stage, only the projection layer (the bridge between vision encoder and LLM, covered in article 5) is trained. The vision encoder and LLM weights are both frozen. The goal is to teach this thin bridge to translate visual features into the token space that the LLM already understands.

The training data is simple: image-caption pairs (e.g., from CC3M or LAION). Given an image $I$ and its caption $c = (c_1, \ldots, c_T)$, the model learns to generate the caption token by token, conditioned on the visual tokens:

$$\mathcal{L}_{\text{align}} = -\sum_{t=1}^{T} \log p_\theta(c_t \mid \mathbf{h}_v^1, \ldots, \mathbf{h}_v^{N_v}, c_1, \ldots, c_{t-1})$$

Let's unpack every component of this loss to understand what it's doing and why.

$\mathbf{h}_v^1, \ldots, \mathbf{h}_v^{N_v}$ : the projected visual tokens produced by the projection layer. These are the visual features from the frozen vision encoder (e.g., ViT patch tokens), mapped through the learnable projection into the LLM's embedding dimension. They form the "visual context" that the LLM conditions on — the model sees these tokens as if they were part of the text prompt, except they carry visual information about the image. The number of visual tokens $N_v$ depends on the vision encoder (e.g., 256 for ViT-L/14 with a $224 \times 224$ image).

$c_1, \ldots, c_{t-1}$ : the caption tokens generated so far. During training, these are teacher-forced — the model receives the ground-truth previous tokens rather than its own predictions. This is standard practice for autoregressive training because it prevents errors from compounding: if the model generated a wrong token at position 3, feeding that mistake back would shift the entire context for positions 4, 5, 6, and so on, making the training signal noisy and unreliable.

$p_\theta(c_t \mid \cdot)$ : the LLM's predicted probability for the next caption token $c_t$, conditioned on both the visual tokens and all preceding caption tokens. The parameters $\theta$ here refer specifically to the projection layer's weights (since the LLM and vision encoder are frozen). This is the model's answer to the question: "given this image and the caption so far, what word comes next?"

The sum over $t$ from 1 to $T$ : standard autoregressive language modelling loss — we compute the negative log-probability of each ground-truth caption token and sum them up. Minimising this loss means maximising the probability the model assigns to the correct caption, token by token. This is the same cross-entropy loss used to pre-train the LLM itself on text; the only difference is that visual tokens now appear in the conditioning context.

Since only the projection layer is trained, this stage is fast (typically a few hours on 8 GPUs) and uses relatively little data (∼600K image-caption pairs in LLaVA). Its purpose is narrow: teach the projection layer to map visual features into a form the frozen LLM can make sense of — aligning the two embedding spaces without changing either the vision encoder or the language model.

💡 Think of Stage 1 as teaching the projection layer to 'translate' from vision-encoder-speak to LLM-speak. The LLM already knows language; it just needs visual tokens that it can interpret as if they were text tokens carrying visual information.

Stage 2: Visual Instruction Tuning

In the second stage, both the projection layer and the LLM are fine-tuned (the vision encoder typically remains frozen). The training data shifts from simple image-caption pairs to multimodal conversations: an image paired with multi-turn question-answer exchanges about that image. This is where the model learns to be a visual assistant — not just describing what it sees, but answering specific questions, following complex instructions, and reasoning about visual content.

The loss is the same autoregressive cross-entropy, but now applied to the instruction-following responses:

$$\mathcal{L}_{\text{instruct}} = -\sum_{t \in \mathcal{A}} \log p_\theta(x_t \mid \mathbf{h}_v, x_1, \ldots, x_{t-1})$$

Let's break down each component and see how this differs from the Stage 1 loss.

$\mathcal{A}$ : the set of token positions corresponding to the assistant's responses. This is the key difference from Stage 1. We do not compute loss on the user's questions or the image tokens — only on what the model should generate. Why? Because the user's questions are given inputs, not targets. Backpropagating through them would waste gradient signal on reconstructing text the model didn't produce and shouldn't need to predict. By masking the loss to assistant tokens only, every gradient update directly improves the model's response quality.

$\mathbf{h}_v$ : shorthand for the full set of projected visual tokens $\mathbf{h}_v^1, \ldots, \mathbf{h}_v^{N_v}$ (the same projected tokens from Stage 1, but now the projection layer continues to be updated alongside the LLM).

$x_1, \ldots, x_{t-1}$ : the full conversation so far — system prompt, visual tokens, user question, and any previous turns of dialogue. In a multi-turn conversation, this context grows with each turn, so the model can refer back to earlier exchanges when generating its current response.

This is exactly the same instruction tuning recipe used for text-only LLMs (like the step from Llama to Llama-Chat), extended to include visual context. The model learns to follow instructions that involve images: "Describe what's happening in this photo", "What text appears on the sign?", "Compare the two objects in the image", and so on.

Why two stages instead of training everything end-to-end from the start? Stage 1 solves a simpler problem (caption generation) with frozen LLM weights, so the projection layer can learn the basic vision-to-language mapping without disrupting the LLM's pre-trained capabilities. Stage 2 then fine-tunes the full model on the harder task (following diverse instructions) now that the projection layer already provides reasonable visual tokens. Training everything end-to-end from scratch tends to work less well, because the LLM receives meaningless visual tokens early in training and can learn to ignore them — a form of modality laziness where the language model falls back on its text-only priors and treats the visual input as noise. The two-stage recipe avoids this by ensuring the visual tokens are already informative before the LLM starts adapting to them.

Generating Visual Instruction Data

The training data for Stage 2 is critical — the model's conversational capabilities are bounded by the quality and diversity of its instruction data. A model trained on short, factual QA pairs will produce short, factual answers; a model trained on rich, multi-step reasoning will learn to reason. LLaVA's key insight was using GPT-4 (text-only, at the time) to generate this data synthetically.

The pipeline works as follows:

  • Step 1: Start with COCO images and their human-written captions plus bounding box annotations. Each image comes with a natural language description and coordinates for the objects present in the scene.
  • Step 2: Feed the caption and bounding box coordinates (as text) to GPT-4. The model never sees the actual image — it only receives a textual description of what the image contains and where objects are located.
  • Step 3: Ask GPT-4 to generate diverse question-answer pairs about the image: detailed descriptions, multi-step reasoning, visual comparisons, spatial reasoning questions ("Is the cat to the left or right of the laptop?"), and creative interpretations.

This produces 158K multimodal instruction-following examples — enough to transform the model from a simple captioner into a visual assistant that can handle open-ended questions. The approach is clever because it leverages GPT-4's strong instruction-following capabilities and diverse language generation without GPT-4 ever seeing the actual image. It works entirely from text descriptions of the image content, which means the pipeline can run at scale without needing access to a vision-capable model.

📌 A limitation of this approach: the instruction data quality is bounded by the caption quality. If the caption misses something in the image (e.g., a person in the background, a subtle texture, or the mood of a scene), GPT-4 cannot generate questions about it. Later work like ShareGPT4V uses GPT-4V (which can see images directly) to generate higher-quality instruction data, capturing visual details that captions alone miss.

Evaluation: How Good Are Visual Assistants?

After instruction tuning, we need to measure how well the model actually performs as a visual assistant. Several standard benchmarks test different aspects of multimodal understanding:

  • VQAv2: open-ended visual question answering on natural images (e.g., "What colour is the bus?", "How many people are in the photo?"). Tests basic visual understanding and grounding — can the model correctly perceive and report simple facts about what's in the image?
  • GQA: compositional questions requiring multi-step reasoning about spatial relationships (e.g., "Is the woman who is wearing glasses standing to the left of the man?"). The model must parse complex linguistic structure and ground each part in the image.
  • TextVQA: questions about text appearing in images (e.g., reading signs, labels, documents). Tests OCR-like capabilities — can the model not just recognise objects but also read and interpret written text within the visual scene?
  • POPE: probes for object hallucination — the model is asked whether a specific object is in the image ("Is there a dog in the image?"). A model that frequently says "yes" to non-existent objects scores poorly. This directly tests a central failure mode of VLMs: does the model report what it actually sees, or does it hallucinate objects that are not present?
  • MMBench / MM-Vet: comprehensive benchmarks covering multiple visual reasoning skills — spatial understanding, attribute recognition, logical reasoning, and more — in a single evaluation suite.

LLaVA-1.5 (Liu et al., 2023) — using a ViT-L/14 vision encoder, a two-layer MLP projection, and Vicuna-13B as the LLM — achieved competitive results across these benchmarks, despite being far simpler than alternatives like BLIP-2 or InstructBLIP that use Q-Former architectures with many more trainable parameters and more complex training procedures. This result helped establish the "simple projection + good instruction data" recipe as the default approach for building VLMs: the quality of the instruction data matters at least as much as the complexity of the architecture.

From VLMs to VLAs

We now have models that can see and speak — they take images and text as input and generate text responses. They can describe scenes, answer questions, read text in images, and follow multimodal instructions. The natural next question is: what if the model could also act ?

Vision-Language-Action models (VLAs), covered in the next track, extend VLMs by adding a third output modality: motor actions. Instead of generating text tokens, the model generates action tokens that control a robot — move the arm left, close the gripper, rotate the wrist. The vision encoder and language backbone from a VLM become the perceptual and reasoning foundation, and an action head is added on top to map the model's internal representations to physical movements.

The VLM track has built up the core components that make this possible:

  • Contrastive pre-training (CLIP, SigLIP) for learning a shared vision-language embedding space.
  • Vision Transformers for converting images into token sequences that transformers can process.
  • Fusion architectures (linear projection, Q-Former, gated cross-attention) for connecting vision encoders to LLMs.
  • Instruction tuning for teaching models to follow multimodal commands — the same capability a robot needs when given a natural language instruction like "pick up the red block".

The VLA track will show how these components are repurposed for robotic control — how a model trained to answer "What is in this image?" can be adapted to answer "What action should I take given this image?"

Quiz

Test your understanding of visual instruction tuning — the two-stage training recipe, instruction data generation, and evaluation.

Why does LLaVA's training have two stages instead of training everything end-to-end from the start?

In Stage 2's loss function, why is the loss computed only over assistant response tokens (the set $\mathcal{A}$) rather than the entire sequence?

How did LLaVA generate visual instruction data without a vision-capable model?

What does the POPE benchmark specifically evaluate?