Octo — A Hybrid Approach
While RT-2 and OpenVLA repurpose VLMs as action generators, and Diffusion Policy/π₀ use dedicated denoising heads, Octo [1] takes a different path: it is a transformer-based generalist policy designed from the ground up for multi-robot, multi-task learning — without relying on a pre-trained VLM at all.
Octo's key insight is to tokenize everything : images, language instructions, proprioceptive state, and actions are all converted into sequences of tokens and processed by a single transformer. This uniform representation allows the model to handle heterogeneous observation and action spaces across different robots.
The architecture has three stages:
- Observation tokenizer: Images are encoded with a ViT [2] , language instructions with a pre-trained language model, and proprioceptive state through a linear layer. All are concatenated into a single token sequence.
- Transformer backbone: A standard transformer processes the combined token sequence using bidirectional attention (unlike GPT-style causal attention). Additionally, a set of readout tokens are appended — learnable tokens that attend to the full observation sequence and aggregate the information needed for action prediction.
- Action head: The readout token representations are passed to an action head. Octo supports two heads: a diffusion head (for multi-modal tasks) and a simple MLP head (for unimodal tasks, with lower compute cost).
Octo was trained on 800K trajectories from OXE and is fully open-source. Its "tokenize everything" approach makes it the most flexible architecture — adding a new sensor modality requires only defining a new tokenizer, not changing the backbone.
Autoregressive vs Diffusion vs Flow
We have now seen three distinct paradigms for action generation in VLAs. Let's compare them systematically.
Action representation:
- Autoregressive (RT-2 [3] , OpenVLA [4] ): Discrete bins ($K = 256$ per dimension). Actions are generated as tokens, one dimension at a time. Inherits the LLM vocabulary and generation machinery directly.
- Diffusion (Diffusion Policy [5] , Octo [1] ): Continuous vectors. A noise prediction network iteratively denoises random noise into clean action chunks. Requires a dedicated action head.
- Flow (π₀ [6] ): Continuous vectors. A velocity prediction network transports noise to data along learned paths. Uses an ODE solver instead of iterative denoising.
Multi-modal handling:
- Autoregressive: Struggles with multi-modal distributions. Mode averaging is a known failure case.
- Diffusion: Naturally handles multi-modal distributions — the stochastic reverse process can sample from different modes. This is a core strength.
- Flow: Also handles multi-modality well. The learned velocity field can diverge into different modes depending on the initial noise sample.
Inference speed:
- Autoregressive: $d$ forward passes per action (typically 7). Each pass is a full LLM forward pass (~50-200 ms for 7B models). Bottleneck for real-time control.
- Diffusion: 20-100 denoising steps per action chunk (~2-5 ms each). With $K = 20$ and $H = 16$: ~40-100 ms for 16 actions.
- Flow: 5-10 ODE steps per action chunk. With $N = 10$ and $H = 16$: ~20-50 ms for 16 actions. Fastest option.
Pre-training leverage:
- Autoregressive: Maximum reuse of VLM pre-training. No architectural changes needed.
- Diffusion: Moderate. The VLM backbone provides conditioning features, but the denoiser must be trained from scratch.
- Flow: Similar to diffusion, but π₀ shows that the action expert can be trained alongside the VLM backbone with shared representations.
Scaling Laws for Robot Learning
In language modelling, scaling laws [7] predict that performance improves as a power law with model size, dataset size, and compute. Does the same hold for VLAs?
Early evidence suggests qualified yes :
- More data helps — with caveats: RT-2-X [8] showed that training on OXE (22 embodiments) improved performance compared to single-robot training. But the benefit saturates quickly for any single task; most gains come from improved generalisation to new tasks and objects.
- Model scale has diminishing returns (so far): OpenVLA (7B) matches RT-2-X (55B) on most benchmarks. Current robot datasets may be too small to benefit from larger models.
- VLM pre-training is a strong prior: Models initialised from VLMs consistently outperform those trained from scratch on robot data alone. Web knowledge (object recognition, spatial reasoning, language understanding) transfers powerfully to robotics.
- Cross-embodiment transfer is real but limited: A model trained on WidowX data can help performance on Google Robot tasks, but the transfer is weaker than within-embodiment scaling.
The VLA field is roughly where language modelling was in 2019-2020: we have shown that the approach works, but we are far from saturating the scaling curve because data collection remains orders of magnitude more expensive than text scraping.
Open Challenges
Despite remarkable recent progress, VLAs face several fundamental challenges:
- Data scarcity: The largest robot dataset (OXE, ~1M trajectories) is tiny compared to language or vision datasets. Simulation can help, but the sim-to-real gap — differences between simulated and real physics, rendering, and sensor noise — means simulated data does not transfer perfectly.
- Long-horizon planning: Current VLAs excel at short-horizon tasks (reach, grasp, place) but struggle with multi-step plans ("make a sandwich"). Hierarchical approaches are an active research direction (see SayCan [9] and Inner Monologue [10] ).
- Dexterous manipulation: Most VLAs target parallel-jaw grippers with 7 DoF. Dexterous hands (20+ DoF) have exponentially larger action spaces, requiring either much more data or much better inductive biases.
- Safety and robustness: A language model that hallucinates produces wrong text. A robot that hallucinates produces dangerous physical motion. Ensuring that VLAs fail gracefully, respect workspace boundaries, and avoid harmful actions is critical for deployment outside controlled lab environments.
- Real-time constraints: Manipulation typically requires 10-30 Hz control. A 7B parameter model with 7 autoregressive action tokens needs ~350 ms per action on an A100 GPU — far too slow for 30 Hz. Smaller models, action chunking, and flow-based methods help, but deploying VLAs on edge hardware remains challenging.
- Evaluation: There is no "ImageNet for robotics" — no standard benchmark where all methods are compared fairly. Different papers evaluate on different robots, tasks, and success criteria. SIMPLER [11] aims to standardise simulation-based evaluation.
Despite these challenges, the trajectory is clear: VLAs are converging toward a unified architecture where a single model perceives, reasons, and acts. The debate is no longer whether foundation models can control robots, but how to scale them efficiently. Flow matching VLAs like π₀ represent the current frontier, but the field is moving fast — the next breakthrough may well redefine the architectural landscape entirely.
Quiz
Test your understanding of VLA architectures and the challenges ahead.
What is the role of "readout tokens" in Octo's architecture?
Which action prediction paradigm achieves the fastest inference for real-time robot control?
What does the fact that OpenVLA (7B) matches RT-2-X (55B) suggest about VLA scaling?
Which action prediction approach best handles multi-modal action distributions (e.g., multiple valid grasps)?
What is the "sim-to-real gap" and why does it matter for VLA training?