VLAs: From Perception to Action

The Embodied AI Challenge

Large Language Models can write poetry, summarise legal briefs, and pass medical exams — but they cannot pick up a cup of coffee. The moment an AI system must physically interact with the world, it faces a fundamentally different problem: converting high-level understanding into precise, real-time motor commands — on hardware with noisy sensors and imperfect actuators.

Traditional robotic manipulation pipelines decompose this into rigid stages — perception (detect the cup), planning (compute a collision-free path), and control (send joint torques). Each stage is a separate, hand-engineered module. This works well in structured environments like factory assembly lines, but it tends to be brittle: change the lighting, swap the cup for a bowl, or rephrase the instruction, and the pipeline often breaks.

Vision-Language-Action models (VLAs) aim to replace this fragile pipeline with a single neural network that takes raw pixels and a natural-language instruction and directly outputs motor actions.

💡 Think of a VLA as a foundation model for robotics: just as GPT generalises across text tasks, a VLA aims to generalise across robot tasks, embodiments, and environments.

From VLMs to VLAs

Vision-Language Models (VLMs) like CLIP (Radford et al., 2021) , LLaVA (Liu et al., 2023) , and PaLI (Chen et al., 2022) already fuse visual and linguistic understanding. VLAs build directly on top of this by adding a third modality: actions .

A VLA has three core components:

Vision encoder: Converts raw camera images into visual tokens. Common architectures include ViT (Dosovitskiy et al., 2021) , SigLIP (Zhai et al., 2023) , and DINOv2 (Oquab et al., 2024) . Some VLAs use two encoders — one for semantic understanding and another for spatial/geometric features.
Language backbone: A pre-trained LLM (e.g., Llama 2 (Touvron et al., 2023) , PaLM, Gemma) that processes visual tokens concatenated with the language instruction.
Action head: The component that produces motor commands. This is where VLAs diverge from VLMs — it can be as simple as decoding action tokens from the LLM vocabulary, or as complex as a diffusion model generating continuous action trajectories.

The vision encoder and language backbone can be initialised from a pre-trained VLM, inheriting its visual and linguistic understanding. The model then only needs to learn the mapping from that understanding to motor actions.

Formally, a VLA learns a policy $\pi$ mapping an observation $o_t$ and language instruction $\ell$ to an action $a_t$:

a_t = \pi(o_t, \ell)

where $a_t \in \mathbb{R}^d$ is typically a $d$-dimensional vector of end-effector deltas ($\Delta x, \Delta y, \Delta z$, rotation, gripper). For a 7-DoF robot arm, $d = 7$. But what does that actually mean in practice? The next section breaks it down.

What Does a Robot Action Look Like?

To move a robot arm, we need to tell it exactly how to change its position in 3D space. The most common representation is end-effector control : instead of specifying individual joint angles (which vary between robot models), we specify how the tip of the arm — the end effector, typically a gripper — should move. This makes the action space consistent across different robot hardware.

A standard 7-DoF (7 degrees of freedom) action vector decomposes into three groups:

# A single action vector for a 7-DoF robot arm
# ┌─────────────────────────────────────────────────────────────────┐
# │  Action vector: a_t ∈ ℝ⁷                                      │
# ├─────────────────────────────────────────────────────────────────┤
# │                                                                 │
# │  POSITION (where to move)         3 dimensions                  │
# │  ├── Δx : move left/right         e.g. +0.02 m  (2 cm right)   │
# │  ├── Δy : move forward/backward   e.g. -0.01 m  (1 cm back)    │
# │  └── Δz : move up/down            e.g. +0.05 m  (5 cm up)      │
# │                                                                 │
# │  ROTATION (how to tilt/turn)      3 dimensions                  │
# │  ├── Δroll  : rotate around x     e.g. +0.0 rad (no change)    │
# │  ├── Δpitch : rotate around y     e.g. -0.1 rad (tilt down)    │
# │  └── Δyaw   : rotate around z     e.g. +0.0 rad (no change)    │
# │                                                                 │
# │  GRIPPER (open or close)          1 dimension                   │
# │  └── grip  : gripper state        e.g. +1.0 (close gripper)    │
# │                                                                 │
# └─────────────────────────────────────────────────────────────────┘
#
# Example: "pick up the mug" might produce this sequence of actions:
#
# Step 1: Move above the mug     [+0.05, +0.10, +0.00, 0, 0, 0, -1]  (approach, open gripper)
# Step 2: Lower to the mug       [+0.00, +0.00, -0.08, 0, 0, 0, -1]  (descend, still open)
# Step 3: Close gripper           [+0.00, +0.00, +0.00, 0, 0, 0, +1]  (grasp)
# Step 4: Lift                    [+0.00, +0.00, +0.12, 0, 0, 0, +1]  (lift, gripper closed)

Each value is a delta — a change relative to the current position, not an absolute coordinate. This is important: the model doesn't need to know where the arm is in the room, only how much to move it from wherever it currently is. Deltas are typically small (a few centimetres per step), and the robot executes actions at 5–10 Hz, so smooth motion emerges from many small steps.

The gripper dimension is often binary in practice (open or closed), but it's represented as a continuous value so it fits the same framework as the position and rotation dimensions. Values near $-1$ typically mean "open" and values near $+1$ mean "close" (conventions vary between datasets, which is one of the challenges covered in article 3).

💡 Why 7 dimensions and not more? Seven is the minimum needed to fully control a gripper's pose in 3D space: three for position ($x, y, z$), three for orientation (roll, pitch, yaw), and one for the gripper itself. Some tasks require more — a dexterous hand like the LEAP hand has 16 joints, giving $d = 16$. But for the single-arm manipulation tasks that most current VLAs target, 7 is the standard.

Key Datasets

Language models benefit from trillions of tokens scraped from the web. Robot data must be collected in the physical world — every trajectory requires a real (or simulated) robot, making data collection orders of magnitude more expensive.

Open X-Embodiment (OXE) (Open X-Embodiment et al., 2024) : A collaborative dataset from 20+ research institutions aggregating ~970K robot trajectories across 22 embodiments (arms, hands, mobile manipulators). The primary training source for OpenVLA and RT-2-X.
RT-1 Dataset (Brohan et al., 2022) : ~130K episodes of table-top manipulation (picking, placing, opening drawers) collected by Google on a fleet of Everyday Robots, annotated with language instructions.
Bridge V2 (Walke et al., 2023) : ~60K demonstrations on a WidowX arm across diverse kitchen environments with crowd-sourced language annotations.
DROID (Khazatsky et al., 2024) : ~76K episodes across 564 unique scenes with a Franka Emika Panda arm, with wrist and external cameras.

📌 For context: GPT-3 was trained on ~500 billion tokens. Open X-Embodiment, the largest robot dataset, contains ~970K trajectories. The total "tokens" of robot experience are many orders of magnitude smaller than what language models see. This data scarcity is the central bottleneck in robot learning.

The Action Representation Problem

The most consequential design decision in a VLA is how to represent actions. Robot actions are inherently continuous — a joint velocity of 0.347 rad/s is different from 0.348 rad/s. But LLMs produce discrete tokens . There are two main approaches:

Discrete tokenization (autoregressive): Bin each continuous action dimension into $K$ discrete buckets (e.g., $K = 256$) and treat each bin as a token. The LLM generates action tokens one at a time. Used by RT-2 (Brohan et al., 2023) and OpenVLA (Kim et al., 2024) . Pros: reuses LLM architecture directly. Cons: loses precision, assumes uni-modal distributions, sequential decoding can be slow.
Continuous generation (diffusion/flow): A separate action head generates continuous actions via a denoising process (Diffusion Policy (Chi et al., 2024) ) or a learned velocity field (flow matching (Lipman et al., 2023) ). Pros: captures multi-modal distributions, preserves full precision. Cons: requires architectural modifications.

Consider "pick up the mug." There may be multiple valid grasps — from the left, right, or above. A discrete autoregressive model tends to commit to a single grasp strategy from the first token. A diffusion-based model can represent the full distribution of valid grasps. This distinction becomes particularly important for dexterous tasks.

The following articles cover both families: the autoregressive approach (RT-2 and OpenVLA), the continuous approach (Diffusion Policy and π₀ (Black et al., 2024) ), and a final comparison.

Quiz

Test your understanding of VLA fundamentals.

What is the main limitation of traditional robotic manipulation pipelines?

They require too much compute

They are brittle — small changes in environment, objects, or instructions break them

They cannot move fast enough

They require too many robots

What does a VLA's policy π map from and to?

From text to images

From observation and language instruction to motor actions

From joint angles to language descriptions

From motor commands to visual predictions

Approximately how many robot trajectories does the Open X-Embodiment dataset contain?

~10K trajectories across 5 robots

~100K trajectories across 10 robots

~970K trajectories across 22 robots

~10M trajectories across 50 robots

Which component fundamentally differentiates a VLA from a VLM?

A larger language model backbone

The action head that produces motor commands

A more powerful vision encoder

Training on more internet data

Why might a discrete autoregressive action model struggle with a task that has multiple valid solutions (e.g., grasping a mug from different angles)?

It cannot process visual inputs fast enough

It may average between modes, producing an invalid compromise action

It uses too many parameters for simple tasks

It requires more training data than continuous models