OpenVLA: Open-Source Generalist Robot Policy

Architecture

OpenVLA (Kim et al., 2024) is the first open-source, 7-billion-parameter VLA trained to be a generalist robot policy. While RT-2 demonstrated the VLA concept, its weights were never released and it required Google-scale infrastructure. OpenVLA democratises the approach: its code, weights, and training recipe are fully open.

The model is built on the Prismatic VLM (Karamcheti et al., 2024) framework, which features a carefully designed dual-encoder vision system:

SigLIP (ViT-SO400M) (Zhai et al., 2023) : A contrastive vision-language encoder trained with a sigmoid loss. SigLIP excels at semantic understanding — recognising objects, understanding scenes, and grounding language to visual elements.
DINOv2 (ViT-L) (Oquab et al., 2024) : A self-supervised vision encoder trained with a self-distillation objective (no language). DINOv2 captures fine-grained spatial and geometric features — edges, textures, depth cues, and object boundaries.

Why two encoders? Robotic manipulation requires both understanding what an object is (semantic — SigLIP) and where exactly it is and what it looks like geometrically (spatial — DINOv2). Ablations in the Prismatic VLM paper show that this combination outperforms either encoder alone on grounded visual reasoning tasks.

The visual features from both encoders are concatenated and projected through an MLP into the token embedding space of Llama 2 7B (Touvron et al., 2023) . For a 224×224 input image, each encoder produces 256 patch tokens (from 14×14 patches of 16×16 pixels), giving 512 visual tokens total after concatenation.

The full input sequence to the transformer looks like:

[\text{SigLIP}_1, \ldots, \text{SigLIP}_{256}, \text{DINOv2}_1, \ldots, \text{DINOv2}_{256}, \ell_1, \ldots, \ell_m, a_1, \ldots, a_7]

where $\ell_1, \ldots, \ell_m$ are the language instruction tokens and $a_1, \ldots, a_7$ are the action tokens to be generated.

Training on Open X-Embodiment

OpenVLA is trained on the Open X-Embodiment (OXE) dataset (Open X-Embodiment et al., 2024) — specifically, the same data mixture used by RT-2-X — comprising approximately 970K robot trajectories across 22 robot embodiments . This includes single-arm manipulators (WidowX, Google Robot, xArm), bi-manual setups, and mobile manipulators.

The training recipe involves two stages:

Stage 1 — VLM pre-training: The Prismatic VLM is first trained on vision-language tasks (image captioning, visual question answering) to build a strong visual-linguistic foundation. The vision encoders remain frozen; only the projection MLP and the language model are updated.
Stage 2 — Robot fine-tuning: The full model (including vision encoders) is fine-tuned on OXE robot data. 256 new action tokens are added to the vocabulary. The learning rate is lower than in stage 1, and the training focuses on the next-token prediction loss over action tokens only.

A key challenge with OXE is action space heterogeneity : different robots have different numbers of joints, different action scales, and different control modes (position, velocity, delta). OpenVLA normalises all actions to a common 7-dimensional format (6 DoF end-effector deltas + gripper) and scales each dimension to $[-1, 1]$ based on dataset statistics.

📌 Action space normalisation is nontrivial. The same "move 1 unit right" means completely different physical motions on a WidowX (small desktop arm) vs a Google Robot (large mobile manipulator). Getting this normalisation wrong can cause the model to learn conflicting action semantics.

Action Prediction Pipeline

At inference time, OpenVLA predicts actions autoregressively. Given a camera image $o_t$ and a language instruction $\ell$, the model generates 7 action tokens sequentially:

a_t^{(i)} \sim p_\theta\bigl(\cdot \;\big|\; a_t^{(1)}, \ldots, a_t^{(i-1)},\, o_t,\, \ell\bigr), \quad i = 1, \ldots, 7

Each token $a_t^{(i)}$ is an integer in $\{0, 1, \ldots, 255\}$ representing a bin in the corresponding action dimension. The 7 dimensions are:

$a^{(1)}, a^{(2)}, a^{(3)}$: End-effector position deltas ($\Delta x, \Delta y, \Delta z$)
$a^{(4)}, a^{(5)}, a^{(6)}$: End-effector rotation deltas (roll, pitch, yaw)
$a^{(7)}$: Gripper state (open/close — effectively binary, but still tokenized into 256 bins)

After generating all 7 tokens, each bin index is de-tokenized back to a continuous value using the bin-centre formula from the previous article. The resulting 7-dimensional action vector is sent to the robot's controller.

The overall loss during training is:

\mathcal{L}_{\text{OpenVLA}} = -\frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{7} \log p_\theta\bigl(a_{t,n}^{(i)*} \;\big|\; a_{t,n}^{(1)*}, \ldots, a_{t,n}^{(i-1)*},\, o_{t,n},\, \ell_n\bigr)

where $N$ is the batch size and $a_{t,n}^{(i)*}$ denotes the ground-truth bin for dimension $i$ in sample $n$. Note that the loss is computed only over action tokens , not over image or language tokens — the model is not trained to reconstruct its inputs.

Fine-Tuning for New Robots

A generalist policy trained on OXE provides a strong starting point, but real-world deployment typically requires adaptation to a specific robot, environment, and task set. OpenVLA supports efficient fine-tuning via Low-Rank Adaptation ( LoRA ) (Hu et al., 2022) .

LoRA freezes the pre-trained weights $W_0$ and adds low-rank update matrices $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with rank $r \ll d$. The effective weight becomes:

W = W_0 + \Delta W = W_0 + BA

With $r = 32$ (the default for OpenVLA), only ~1.4% of the total parameters are trainable, drastically reducing GPU memory requirements. Fine-tuning can be done on a single A100 GPU with as few as 20-50 demonstrations in the target domain.

OpenVLA was benchmarked against RT-2-X (the closest comparable model) on two robot platforms:

Google Robot: On table-top manipulation tasks, OpenVLA (7B) achieved comparable success rates to RT-2-X (55B) despite being 8× smaller, according to the evaluations in the original paper.
WidowX: On Bridge V2 tasks, fine-tuned OpenVLA significantly outperformed RT-2-X in the authors' evaluations, likely because the smaller model can be more precisely adapted with LoRA to the target domain.

💡 The fact that a 7B model matches or beats a 55B model on the evaluated robotic tasks is striking. It suggests that for physical manipulation, the quality and relevance of training data may matter more than raw model scale — at least at current data volumes.

Quiz

Test your understanding of OpenVLA's architecture and training.

Why does OpenVLA use two separate vision encoders (SigLIP and DINOv2)?

To double the number of parameters for better performance

SigLIP provides semantic understanding while DINOv2 provides spatial/geometric features

One encoder processes the current frame and the other processes the previous frame

They are used for different camera viewpoints (wrist and external)

How does OpenVLA handle the fact that different robots have different action spaces?

It trains a separate model for each robot

It normalises all actions to a common 7-dimensional format scaled to [-1, 1]

It uses a different vocabulary size for each robot

It ignores robots with incompatible action spaces

What does OpenVLA's training loss compute over?

All tokens: image, language, and action

Only the action tokens (7 per timestep)

Only the language instruction tokens

Only the image tokens from both vision encoders

What percentage of OpenVLA's parameters are trainable when using LoRA with rank r = 32?

~10%

~5%

~1.4%

~0.1%

What does the finding that OpenVLA (7B) matches RT-2-X (55B) suggest about current VLA scaling?

Larger models are always better for robotics

Data quality and relevance may matter more than model scale at current data volumes

Model architecture has no impact on performance

7B parameters is the optimal size for all robot tasks