The Embodied AI Challenge
Large Language Models can write poetry, summarise legal briefs, and pass medical exams — but they cannot pick up a cup of coffee. The moment an AI system must physically interact with the world, it faces a fundamentally different problem: converting high-level understanding into precise, real-time motor commands on hardware with noisy sensors and imperfect actuators.
Traditional robotic manipulation pipelines decompose this into rigid stages — perception (detect the cup), planning (compute a collision-free path), and control (send joint torques). Each stage is a separate, hand-engineered module. This works well in structured environments like factory assembly lines, but it is brittle: change the lighting, swap the cup for a bowl, or rephrase the instruction, and the pipeline breaks.
Vision-Language-Action models (VLAs) aim to replace this fragile pipeline with a single neural network that takes raw pixels and a natural-language instruction and directly outputs motor actions.
From VLMs to VLAs
Vision-Language Models (VLMs) like CLIP [1] , LLaVA [2] , and PaLI [3] already fuse visual and linguistic understanding. VLAs build directly on top of this by adding a third modality: actions .
A VLA has three core components:
- Vision encoder: Converts raw camera images into visual tokens. Common architectures include ViT [4] , SigLIP [5] , and DINOv2 [6] . Some VLAs use two encoders — one for semantic understanding and another for spatial/geometric features.
- Language backbone: A pre-trained LLM (e.g., Llama 2 [7] , PaLM, Gemma) that processes visual tokens concatenated with the language instruction.
- Action head: The component that produces motor commands. This is where VLAs diverge from VLMs — it can be as simple as decoding action tokens from the LLM vocabulary, or as complex as a diffusion model generating continuous action trajectories.
The vision encoder and language backbone can be initialised from a pre-trained VLM, inheriting its visual and linguistic understanding. The model then only needs to learn the mapping from that understanding to motor actions.
Formally, a VLA learns a policy $\pi$ mapping an observation $o_t$ and language instruction $\ell$ to an action $a_t$:
where $a_t \in \mathbb{R}^d$ is typically a $d$-dimensional vector of end-effector deltas ($\Delta x, \Delta y, \Delta z$, rotation, gripper). For a 7-DoF robot arm, $d = 7$.
Key Datasets
Language models benefit from trillions of tokens scraped from the web. Robot data must be collected in the physical world — every trajectory requires a real (or simulated) robot, making data collection orders of magnitude more expensive.
- Open X-Embodiment (OXE) [8] : A collaborative dataset from 20+ research institutions aggregating ~970K robot trajectories across 22 embodiments (arms, hands, mobile manipulators). The primary training source for OpenVLA and RT-2-X.
- RT-1 Dataset [9] : ~130K episodes of table-top manipulation (picking, placing, opening drawers) collected by Google on a fleet of Everyday Robots, annotated with language instructions.
- Bridge V2 [10] : ~60K demonstrations on a WidowX arm across diverse kitchen environments with crowd-sourced language annotations.
- DROID [11] : ~76K episodes across 564 unique scenes with a Franka Emika Panda arm, with wrist and external cameras.
The Action Representation Problem
The most consequential design decision in a VLA is how to represent actions. Robot actions are inherently continuous — a joint velocity of 0.347 rad/s is different from 0.348 rad/s. But LLMs produce discrete tokens . There are two main approaches:
- Discrete tokenization (autoregressive): Bin each continuous action dimension into $K$ discrete buckets (e.g., $K = 256$) and treat each bin as a token. The LLM generates action tokens one at a time. Used by RT-2 [12] and OpenVLA [13] . Pros: reuses LLM architecture directly. Cons: loses precision, assumes uni-modal distributions, sequential decoding is slow.
- Continuous generation (diffusion/flow): A separate action head generates continuous actions via a denoising process (Diffusion Policy [14] ) or a learned velocity field (flow matching [15] ). Pros: captures multi-modal distributions, preserves full precision. Cons: requires architectural modifications.
Consider "pick up the mug." There may be multiple valid grasps — from the left, right, or above. A discrete autoregressive model can only commit to one from the first token. A diffusion-based model can represent the full distribution of valid grasps. This distinction is critical for dexterous tasks.
The following articles cover both families: the autoregressive approach (RT-2 and OpenVLA), the continuous approach (Diffusion Policy and π₀ [16] ), and a final comparison.
Quiz
Test your understanding of VLA fundamentals.
What is the main limitation of traditional robotic manipulation pipelines?
What does a VLA's policy π map from and to?
Approximately how many robot trajectories does the Open X-Embodiment dataset contain?
Which component fundamentally differentiates a VLA from a VLM?
Why might a discrete autoregressive action model struggle with a task that has multiple valid solutions (e.g., grasping a mug from different angles)?