The Embodied AI Challenge
Large Language Models can write poetry, summarise legal briefs, and pass medical exams — but they cannot pick up a cup of coffee. The moment an AI system must physically interact with the world, it faces a fundamentally different problem: converting high-level understanding into precise, real-time motor commands — on hardware with noisy sensors and imperfect actuators.
Traditional robotic manipulation pipelines decompose this into rigid stages — perception (detect the cup), planning (compute a collision-free path), and control (send joint torques). Each stage is a separate, hand-engineered module. This works well in structured environments like factory assembly lines, but it tends to be brittle: change the lighting, swap the cup for a bowl, or rephrase the instruction, and the pipeline often breaks.
Vision-Language-Action models (VLAs) aim to replace this fragile pipeline with a single neural network that takes raw pixels and a natural-language instruction and directly outputs motor actions.
From VLMs to VLAs
Vision-Language Models (VLMs) like CLIP (Radford et al., 2021) , LLaVA (Liu et al., 2023) , and PaLI (Chen et al., 2022) already fuse visual and linguistic understanding. VLAs build directly on top of this by adding a third modality: actions .
A VLA has three core components:
- Vision encoder: Converts raw camera images into visual tokens. Common architectures include ViT (Dosovitskiy et al., 2021) , SigLIP (Zhai et al., 2023) , and DINOv2 (Oquab et al., 2024) . Some VLAs use two encoders — one for semantic understanding and another for spatial/geometric features.
- Language backbone: A pre-trained LLM (e.g., Llama 2 (Touvron et al., 2023) , PaLM, Gemma) that processes visual tokens concatenated with the language instruction.
- Action head: The component that produces motor commands. This is where VLAs diverge from VLMs — it can be as simple as decoding action tokens from the LLM vocabulary, or as complex as a diffusion model generating continuous action trajectories.
The vision encoder and language backbone can be initialised from a pre-trained VLM, inheriting its visual and linguistic understanding. The model then only needs to learn the mapping from that understanding to motor actions.
Formally, a VLA learns a policy $\pi$ mapping an observation $o_t$ and language instruction $\ell$ to an action $a_t$:
where $a_t \in \mathbb{R}^d$ is typically a $d$-dimensional vector of end-effector deltas ($\Delta x, \Delta y, \Delta z$, rotation, gripper). For a 7-DoF robot arm, $d = 7$. But what does that actually mean in practice? The next section breaks it down.
What Does a Robot Action Look Like?
To move a robot arm, we need to tell it exactly how to change its position in 3D space. The most common representation is end-effector control : instead of specifying individual joint angles (which vary between robot models), we specify how the tip of the arm — the end effector, typically a gripper — should move. This makes the action space consistent across different robot hardware.
A standard 7-DoF (7 degrees of freedom) action vector decomposes into three groups:
# A single action vector for a 7-DoF robot arm
# ┌─────────────────────────────────────────────────────────────────┐
# │ Action vector: a_t ∈ ℝ⁷ │
# ├─────────────────────────────────────────────────────────────────┤
# │ │
# │ POSITION (where to move) 3 dimensions │
# │ ├── Δx : move left/right e.g. +0.02 m (2 cm right) │
# │ ├── Δy : move forward/backward e.g. -0.01 m (1 cm back) │
# │ └── Δz : move up/down e.g. +0.05 m (5 cm up) │
# │ │
# │ ROTATION (how to tilt/turn) 3 dimensions │
# │ ├── Δroll : rotate around x e.g. +0.0 rad (no change) │
# │ ├── Δpitch : rotate around y e.g. -0.1 rad (tilt down) │
# │ └── Δyaw : rotate around z e.g. +0.0 rad (no change) │
# │ │
# │ GRIPPER (open or close) 1 dimension │
# │ └── grip : gripper state e.g. +1.0 (close gripper) │
# │ │
# └─────────────────────────────────────────────────────────────────┘
#
# Example: "pick up the mug" might produce this sequence of actions:
#
# Step 1: Move above the mug [+0.05, +0.10, +0.00, 0, 0, 0, -1] (approach, open gripper)
# Step 2: Lower to the mug [+0.00, +0.00, -0.08, 0, 0, 0, -1] (descend, still open)
# Step 3: Close gripper [+0.00, +0.00, +0.00, 0, 0, 0, +1] (grasp)
# Step 4: Lift [+0.00, +0.00, +0.12, 0, 0, 0, +1] (lift, gripper closed)
Each value is a delta — a change relative to the current position, not an absolute coordinate. This is important: the model doesn't need to know where the arm is in the room, only how much to move it from wherever it currently is. Deltas are typically small (a few centimetres per step), and the robot executes actions at 5–10 Hz, so smooth motion emerges from many small steps.
The gripper dimension is often binary in practice (open or closed), but it's represented as a continuous value so it fits the same framework as the position and rotation dimensions. Values near $-1$ typically mean "open" and values near $+1$ mean "close" (conventions vary between datasets, which is one of the challenges covered in article 3).
Key Datasets
Language models benefit from trillions of tokens scraped from the web. Robot data must be collected in the physical world — every trajectory requires a real (or simulated) robot, making data collection orders of magnitude more expensive.
- Open X-Embodiment (OXE) (Open X-Embodiment et al., 2024) : A collaborative dataset from 20+ research institutions aggregating ~970K robot trajectories across 22 embodiments (arms, hands, mobile manipulators). The primary training source for OpenVLA and RT-2-X.
- RT-1 Dataset (Brohan et al., 2022) : ~130K episodes of table-top manipulation (picking, placing, opening drawers) collected by Google on a fleet of Everyday Robots, annotated with language instructions.
- Bridge V2 (Walke et al., 2023) : ~60K demonstrations on a WidowX arm across diverse kitchen environments with crowd-sourced language annotations.
- DROID (Khazatsky et al., 2024) : ~76K episodes across 564 unique scenes with a Franka Emika Panda arm, with wrist and external cameras.
The Action Representation Problem
The most consequential design decision in a VLA is how to represent actions. Robot actions are inherently continuous — a joint velocity of 0.347 rad/s is different from 0.348 rad/s. But LLMs produce discrete tokens . There are two main approaches:
- Discrete tokenization (autoregressive): Bin each continuous action dimension into $K$ discrete buckets (e.g., $K = 256$) and treat each bin as a token. The LLM generates action tokens one at a time. Used by RT-2 (Brohan et al., 2023) and OpenVLA (Kim et al., 2024) . Pros: reuses LLM architecture directly. Cons: loses precision, assumes uni-modal distributions, sequential decoding can be slow.
- Continuous generation (diffusion/flow): A separate action head generates continuous actions via a denoising process (Diffusion Policy (Chi et al., 2024) ) or a learned velocity field (flow matching (Lipman et al., 2023) ). Pros: captures multi-modal distributions, preserves full precision. Cons: requires architectural modifications.
Consider "pick up the mug." There may be multiple valid grasps — from the left, right, or above. A discrete autoregressive model tends to commit to a single grasp strategy from the first token. A diffusion-based model can represent the full distribution of valid grasps. This distinction becomes particularly important for dexterous tasks.
The following articles cover both families: the autoregressive approach (RT-2 and OpenVLA), the continuous approach (Diffusion Policy and π₀ (Black et al., 2024) ), and a final comparison.
Quiz
Test your understanding of VLA fundamentals.
What is the main limitation of traditional robotic manipulation pipelines?
What does a VLA's policy π map from and to?
Approximately how many robot trajectories does the Open X-Embodiment dataset contain?
Which component fundamentally differentiates a VLA from a VLM?
Why might a discrete autoregressive action model struggle with a task that has multiple valid solutions (e.g., grasping a mug from different angles)?