The Embodied AI Challenge

Large Language Models can write poetry, summarise legal briefs, and pass medical exams — but they cannot pick up a cup of coffee. The moment an AI system must physically interact with the world, it faces a fundamentally different problem: converting high-level understanding into precise, real-time motor commands on hardware with noisy sensors and imperfect actuators.

Traditional robotic manipulation pipelines decompose this into rigid stages — perception (detect the cup), planning (compute a collision-free path), and control (send joint torques). Each stage is a separate, hand-engineered module. This works well in structured environments like factory assembly lines, but it is brittle: change the lighting, swap the cup for a bowl, or rephrase the instruction, and the pipeline breaks.

Vision-Language-Action models (VLAs) aim to replace this fragile pipeline with a single neural network that takes raw pixels and a natural-language instruction and directly outputs motor actions.

💡 Think of a VLA as a foundation model for robotics: just as GPT generalises across text tasks, a VLA aims to generalise across robot tasks, embodiments, and environments.

From VLMs to VLAs

Vision-Language Models (VLMs) like CLIP [1] , LLaVA [2] , and PaLI [3] already fuse visual and linguistic understanding. VLAs build directly on top of this by adding a third modality: actions .

A VLA has three core components:

  • Vision encoder: Converts raw camera images into visual tokens. Common architectures include ViT [4] , SigLIP [5] , and DINOv2 [6] . Some VLAs use two encoders — one for semantic understanding and another for spatial/geometric features.
  • Language backbone: A pre-trained LLM (e.g., Llama 2 [7] , PaLM, Gemma) that processes visual tokens concatenated with the language instruction.
  • Action head: The component that produces motor commands. This is where VLAs diverge from VLMs — it can be as simple as decoding action tokens from the LLM vocabulary, or as complex as a diffusion model generating continuous action trajectories.

The vision encoder and language backbone can be initialised from a pre-trained VLM, inheriting its visual and linguistic understanding. The model then only needs to learn the mapping from that understanding to motor actions.

Formally, a VLA learns a policy $\pi$ mapping an observation $o_t$ and language instruction $\ell$ to an action $a_t$:

$$a_t = \pi(o_t, \ell)$$

where $a_t \in \mathbb{R}^d$ is typically a $d$-dimensional vector of end-effector deltas ($\Delta x, \Delta y, \Delta z$, rotation, gripper). For a 7-DoF robot arm, $d = 7$.

Key Datasets

Language models benefit from trillions of tokens scraped from the web. Robot data must be collected in the physical world — every trajectory requires a real (or simulated) robot, making data collection orders of magnitude more expensive.

  • Open X-Embodiment (OXE) [8] : A collaborative dataset from 20+ research institutions aggregating ~970K robot trajectories across 22 embodiments (arms, hands, mobile manipulators). The primary training source for OpenVLA and RT-2-X.
  • RT-1 Dataset [9] : ~130K episodes of table-top manipulation (picking, placing, opening drawers) collected by Google on a fleet of Everyday Robots, annotated with language instructions.
  • Bridge V2 [10] : ~60K demonstrations on a WidowX arm across diverse kitchen environments with crowd-sourced language annotations.
  • DROID [11] : ~76K episodes across 564 unique scenes with a Franka Emika Panda arm, with wrist and external cameras.
📌 For context: GPT-3 was trained on ~500 billion tokens. Open X-Embodiment, the largest robot dataset, contains ~970K trajectories. The total "tokens" of robot experience are many orders of magnitude smaller than what language models see. This data scarcity is the central bottleneck in robot learning.

The Action Representation Problem

The most consequential design decision in a VLA is how to represent actions. Robot actions are inherently continuous — a joint velocity of 0.347 rad/s is different from 0.348 rad/s. But LLMs produce discrete tokens . There are two main approaches:

  • Discrete tokenization (autoregressive): Bin each continuous action dimension into $K$ discrete buckets (e.g., $K = 256$) and treat each bin as a token. The LLM generates action tokens one at a time. Used by RT-2 [12] and OpenVLA [13] . Pros: reuses LLM architecture directly. Cons: loses precision, assumes uni-modal distributions, sequential decoding is slow.
  • Continuous generation (diffusion/flow): A separate action head generates continuous actions via a denoising process (Diffusion Policy [14] ) or a learned velocity field (flow matching [15] ). Pros: captures multi-modal distributions, preserves full precision. Cons: requires architectural modifications.

Consider "pick up the mug." There may be multiple valid grasps — from the left, right, or above. A discrete autoregressive model can only commit to one from the first token. A diffusion-based model can represent the full distribution of valid grasps. This distinction is critical for dexterous tasks.

The following articles cover both families: the autoregressive approach (RT-2 and OpenVLA), the continuous approach (Diffusion Policy and π₀ [16] ), and a final comparison.

Quiz

Test your understanding of VLA fundamentals.

What is the main limitation of traditional robotic manipulation pipelines?

What does a VLA's policy π map from and to?

Approximately how many robot trajectories does the Open X-Embodiment dataset contain?

Which component fundamentally differentiates a VLA from a VLM?

Why might a discrete autoregressive action model struggle with a task that has multiple valid solutions (e.g., grasping a mug from different angles)?