Architecture

OpenVLA [1] is the first open-source, 7-billion-parameter VLA trained to be a generalist robot policy. While RT-2 demonstrated the VLA concept, its weights were never released and it required Google-scale infrastructure. OpenVLA democratises the approach: its code, weights, and training recipe are fully open.

The model is built on the Prismatic VLM [2] framework, which features a carefully designed dual-encoder vision system:

  • SigLIP (ViT-SO400M) [3] : A contrastive vision-language encoder trained with a sigmoid loss. SigLIP excels at semantic understanding — recognising objects, understanding scenes, and grounding language to visual elements.
  • DINOv2 (ViT-L) [4] : A self-supervised vision encoder trained with a self-distillation objective (no language). DINOv2 captures fine-grained spatial and geometric features — edges, textures, depth cues, and object boundaries.

Why two encoders? Robotic manipulation requires both understanding what an object is (semantic — SigLIP) and where exactly it is and what it looks like geometrically (spatial — DINOv2). Ablations in the Prismatic VLM paper show that this combination outperforms either encoder alone on grounded visual reasoning tasks.

The visual features from both encoders are concatenated and projected through an MLP into the token embedding space of Llama 2 7B [5] . For a 224×224 input image, each encoder produces 256 patch tokens (from 14×14 patches of 16×16 pixels), giving 512 visual tokens total after concatenation.

The full input sequence to the transformer looks like:

$$[\text{SigLIP}_1, \ldots, \text{SigLIP}_{256}, \text{DINOv2}_1, \ldots, \text{DINOv2}_{256}, \ell_1, \ldots, \ell_m, a_1, \ldots, a_7]$$

where $\ell_1, \ldots, \ell_m$ are the language instruction tokens and $a_1, \ldots, a_7$ are the action tokens to be generated.

Training on Open X-Embodiment

OpenVLA is trained on the Open X-Embodiment (OXE) dataset [6] — specifically, the same data mixture used by RT-2-X — comprising approximately 970K robot trajectories across 22 robot embodiments . This includes single-arm manipulators (WidowX, Google Robot, xArm), bi-manual setups, and mobile manipulators.

The training recipe involves two stages:

  • Stage 1 — VLM pre-training: The Prismatic VLM is first trained on vision-language tasks (image captioning, visual question answering) to build a strong visual-linguistic foundation. The vision encoders remain frozen; only the projection MLP and the language model are updated.
  • Stage 2 — Robot fine-tuning: The full model (including vision encoders) is fine-tuned on OXE robot data. 256 new action tokens are added to the vocabulary. The learning rate is lower than in stage 1, and the training focuses on the next-token prediction loss over action tokens only.

A key challenge with OXE is action space heterogeneity : different robots have different numbers of joints, different action scales, and different control modes (position, velocity, delta). OpenVLA normalises all actions to a common 7-dimensional format (6 DoF end-effector deltas + gripper) and scales each dimension to $[-1, 1]$ based on dataset statistics.

📌 Action space normalisation is nontrivial. The same "move 1 unit right" means completely different physical motions on a WidowX (small desktop arm) vs a Google Robot (large mobile manipulator). Getting this normalisation wrong can cause the model to learn conflicting action semantics.

Action Prediction Pipeline

At inference time, OpenVLA predicts actions autoregressively. Given a camera image $o_t$ and a language instruction $\ell$, the model generates 7 action tokens sequentially:

$$a_t^{(i)} \sim p_\theta\bigl(\cdot \;\big|\; a_t^{(1)}, \ldots, a_t^{(i-1)},\, o_t,\, \ell\bigr), \quad i = 1, \ldots, 7$$

Each token $a_t^{(i)}$ is an integer in $\{0, 1, \ldots, 255\}$ representing a bin in the corresponding action dimension. The 7 dimensions are:

  • $a^{(1)}, a^{(2)}, a^{(3)}$: End-effector position deltas ($\Delta x, \Delta y, \Delta z$)
  • $a^{(4)}, a^{(5)}, a^{(6)}$: End-effector rotation deltas (roll, pitch, yaw)
  • $a^{(7)}$: Gripper state (open/close — effectively binary, but still tokenized into 256 bins)

After generating all 7 tokens, each bin index is de-tokenized back to a continuous value using the bin-centre formula from the previous article. The resulting 7-dimensional action vector is sent to the robot's controller.

The overall loss during training is:

$$\mathcal{L}_{\text{OpenVLA}} = -\frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{7} \log p_\theta\bigl(a_{t,n}^{(i)*} \;\big|\; a_{t,n}^{(1)*}, \ldots, a_{t,n}^{(i-1)*},\, o_{t,n},\, \ell_n\bigr)$$

where $N$ is the batch size and $a_{t,n}^{(i)*}$ denotes the ground-truth bin for dimension $i$ in sample $n$. Note that the loss is computed only over action tokens , not over image or language tokens — the model is not trained to reconstruct its inputs.

Fine-Tuning for New Robots

A generalist policy trained on OXE provides a strong starting point, but real-world deployment typically requires adaptation to a specific robot, environment, and task set. OpenVLA supports efficient fine-tuning via Low-Rank Adaptation (LoRA) [7] .

LoRA freezes the pre-trained weights $W_0$ and adds low-rank update matrices $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with rank $r \ll d$. The effective weight becomes:

$$W = W_0 + \Delta W = W_0 + BA$$

With $r = 32$ (the default for OpenVLA), only ~1.4% of the total parameters are trainable, drastically reducing GPU memory requirements. Fine-tuning can be done on a single A100 GPU with as few as 20-50 demonstrations in the target domain.

OpenVLA was benchmarked against RT-2-X (the closest comparable model) on two robot platforms:

  • Google Robot: On table-top manipulation tasks, OpenVLA (7B) achieved comparable success rates to RT-2-X (55B) despite being 8× smaller.
  • WidowX: On Bridge V2 tasks, fine-tuned OpenVLA significantly outperformed RT-2-X, likely because the smaller model can be more precisely adapted with LoRA to the target domain.
💡 The fact that a 7B model matches or beats a 55B model on robotic tasks is striking. It suggests that for physical manipulation, the quality and relevance of training data may matter more than raw model scale — at least at current data volumes.

Quiz

Test your understanding of OpenVLA's architecture and training.

Why does OpenVLA use two separate vision encoders (SigLIP and DINOv2)?

How does OpenVLA handle the fact that different robots have different action spaces?

What does OpenVLA's training loss compute over?

What percentage of OpenVLA's parameters are trainable when using LoRA with rank r = 32?

What does the finding that OpenVLA (7B) matches RT-2-X (55B) suggest about current VLA scaling?