Architecture Comparison & Future Directions

Octo — A Hybrid Approach

While RT-2 and OpenVLA repurpose VLMs as action generators, and Diffusion Policy/π₀ use dedicated denoising heads, Octo (Team et al., 2024) takes a different path: it is a transformer-based generalist policy designed from the ground up for multi-robot, multi-task learning — without relying on a pre-trained VLM at all.

Octo's key insight is to tokenize everything : images, language instructions, proprioceptive state, and actions are all converted into sequences of tokens and processed by a single transformer. This uniform representation allows the model to handle heterogeneous observation and action spaces across different robots.

The architecture has three stages:

Observation tokenizer: Images are encoded with a ViT (Dosovitskiy et al., 2021) , language instructions with a pre-trained language model, and proprioceptive state through a linear layer. All are concatenated into a single token sequence.
Transformer backbone: A standard transformer processes the combined token sequence using bidirectional attention (unlike GPT-style causal attention). Additionally, a set of readout tokens are appended — learnable tokens that attend to the full observation sequence and aggregate the information needed for action prediction.
Action head: The readout token representations are passed to an action head. Octo supports two heads: a diffusion head (for multi-modal tasks) and a simple MLP head (for unimodal tasks, with lower compute cost).

Octo was trained on 800K trajectories from OXE and is fully open-source. Its "tokenize everything" approach makes it one of the most flexible architectures — adding a new sensor modality requires only defining a new tokenizer, not changing the backbone.

💡 Readout tokens are like designated "summary" positions in the sequence. While observation tokens encode what the robot sees and hears, readout tokens learn to ask the right questions of the observation tokens via attention and distill the answers into a compact representation for action prediction.

Autoregressive vs Diffusion vs Flow

We have now seen three distinct paradigms for action generation in VLAs. Let's compare them systematically.

Action representation:

Autoregressive (RT-2 (Brohan et al., 2023) , OpenVLA (Kim et al., 2024) ): Discrete bins ($K = 256$ per dimension). Actions are generated as tokens, one dimension at a time. Inherits the LLM vocabulary and generation machinery directly.
Diffusion (Diffusion Policy (Chi et al., 2024) , Octo (Team et al., 2024) ): Continuous vectors. A noise prediction network iteratively denoises random noise into clean action chunks. Requires a dedicated action head.
Flow (π₀ (Black et al., 2024) ): Continuous vectors. A velocity prediction network transports noise to data along learned paths. Uses an ODE solver instead of iterative denoising.

Multi-modal handling:

Autoregressive: Tends to struggle with multi-modal distributions, because selecting one bin at a time can lead to mode averaging — a well-documented failure case where the model produces invalid compromise actions that fall between two valid strategies.
Diffusion: Naturally handles multi-modal distributions — the stochastic reverse process can sample from different modes. This is a core strength.
Flow: Also handles multi-modality well. The learned velocity field can diverge into different modes depending on the initial noise sample.

Inference speed:

Autoregressive: $d$ forward passes per action (typically 7). Each pass is a full LLM forward pass (~50-200 ms for 7B models). Bottleneck for real-time control.
Diffusion: 20-100 denoising steps per action chunk (~2-5 ms each). With $K = 20$ and $H = 16$: ~40-100 ms for 16 actions.
Flow: 5-10 ODE steps per action chunk. With $N = 10$ and $H = 16$: ~20-50 ms for 16 actions. Fastest option.

Pre-training leverage:

Autoregressive: Maximum reuse of VLM pre-training. No architectural changes needed.
Diffusion: Moderate. The VLM backbone provides conditioning features, but the denoiser must be trained from scratch.
Flow: Similar to diffusion, but π₀ shows that the action expert can be trained alongside the VLM backbone with shared representations.

Scaling Laws for Robot Learning

In language modelling, scaling laws (Kaplan et al., 2020) predict that performance improves as a power law with model size, dataset size, and compute. Does the same hold for VLAs?

Early evidence suggests qualified yes :

More data helps — with caveats: RT-2-X (Open X-Embodiment et al., 2024) showed that training on OXE (22 embodiments) improved performance compared to single-robot training. But the benefit saturates quickly for any single task; most gains come from improved generalisation to new tasks and objects.
Model scale has diminishing returns (so far): OpenVLA (7B) matched RT-2-X (55B) on most benchmarks tested in the original paper, suggesting that current robot datasets may be too small to benefit from larger models.
VLM pre-training is a strong prior: Models initialised from VLMs consistently outperform those trained from scratch on robot data alone. Web knowledge (object recognition, spatial reasoning, language understanding) transfers powerfully to robotics.
Cross-embodiment transfer is real but limited: A model trained on WidowX data can help performance on Google Robot tasks, but the transfer is weaker than within-embodiment scaling.

The VLA field is roughly where language modelling was in 2019-2020: we have shown that the approach works, but we are far from saturating the scaling curve because data collection remains orders of magnitude more expensive than text scraping.

Open Challenges

Despite remarkable recent progress, VLAs face several fundamental challenges:

Data scarcity: The largest robot dataset (OXE, ~1M trajectories) is tiny compared to language or vision datasets. Simulation can help, but the sim-to-real gap — differences between simulated and real physics, rendering, and sensor noise — means simulated data does not transfer perfectly.
Long-horizon planning: Current VLAs excel at short-horizon tasks (reach, grasp, place) but struggle with multi-step plans ("make a sandwich"). Hierarchical approaches are an active research direction (see SayCan (Ahn et al., 2022) and Inner Monologue (Huang et al., 2023) ).
Dexterous manipulation: Most VLAs target parallel-jaw grippers with 7 DoF. Dexterous hands (20+ DoF) have exponentially larger action spaces, requiring either much more data or much better inductive biases.
Safety and robustness: A language model that hallucinates produces wrong text. A robot that hallucinates produces dangerous physical motion. Ensuring that VLAs fail gracefully, respect workspace boundaries, and avoid harmful actions is critical for deployment outside controlled lab environments.
Real-time constraints: Manipulation typically requires 10-30 Hz control. A 7B parameter model generating 7 autoregressive action tokens can require hundreds of milliseconds per action even on high-end GPUs — far too slow for 30 Hz. Smaller models, action chunking, and flow-based methods help, but deploying VLAs on edge hardware remains challenging.
Evaluation: There is no "ImageNet for robotics" — no standard benchmark where all methods are compared fairly. Different papers evaluate on different robots, tasks, and success criteria. SIMPLER (Li et al., 2024) aims to standardise simulation-based evaluation.

📌 The VLA field is approximately 2 years old (RT-2 was published in July 2023). For comparison, GPT-1 appeared in June 2018, and it took ~4 years to reach GPT-4's level of capability. If VLAs follow a similar trajectory, we are still in the very early stages.

Despite these challenges, the trajectory is clear: VLAs are converging toward a unified architecture where a single model perceives, reasons, and acts. The debate is no longer whether foundation models can control robots, but how to scale them efficiently. Flow matching VLAs like π₀ represent the current frontier, but the field is moving fast and the next breakthrough may well reshape the design space entirely.

Quiz

Test your understanding of VLA architectures and the challenges ahead.

What is the role of "readout tokens" in Octo's architecture?

They encode the language instruction

They are learnable tokens that attend to observations and aggregate information for action prediction

They store the robot's proprioceptive state

They represent the output action dimensions

Which action prediction paradigm achieves the fastest inference for real-time robot control?

Autoregressive (RT-2, OpenVLA)

Diffusion (Diffusion Policy)

Flow matching (π₀) — 5-10 ODE steps producing ~20-50 ms per chunk

All three are equally fast

What does the fact that OpenVLA (7B) matches RT-2-X (55B) suggest about VLA scaling?

Larger models are always better

Current robot datasets may be too small to benefit from larger models — data is the bottleneck, not model capacity

7B is the optimal model size for all tasks

Model architecture matters more than model or data size

Which action prediction approach best handles multi-modal action distributions (e.g., multiple valid grasps)?

Autoregressive tokenization (RT-2, OpenVLA)

Diffusion and flow-based generation (Diffusion Policy, π₀)

Simple MLP action head

All approaches handle multi-modality equally well

What is the "sim-to-real gap" and why does it matter for VLA training?

The difference in model size between simulation and real-world models

The difference between simulated and real physics/rendering/sensors, which limits how well simulated data transfers to real robots

The time it takes to move a model from simulation to a real robot

The gap between simulated and real-world language instructions