Why Tokenize Actions?
Large Language Models are token machines. They take in tokens, process them through transformer layers, and output a probability distribution over the next token. This autoregressive machinery is extraordinarily well-tested — billions of dollars of compute have gone into making it work reliably.
The core idea behind action tokenization is deceptively simple: if we can turn robot actions into tokens, we can reuse the entire LLM stack without modification. The vision encoder produces image tokens, the user provides language tokens, and the model generates action tokens — all flowing through the same transformer architecture.
This is attractive for several reasons. First, we inherit the LLM's pre-trained world knowledge and reasoning capabilities. Second, we avoid designing a custom action decoder. Third, the entire system can be trained end-to-end with the standard next-token prediction loss — cross-entropy over the vocabulary.
But actions are continuous (e.g., move the arm 0.0347 metres to the right), while tokens are discrete (integer indices into a vocabulary). We need a way to bridge this gap.
Uniform Binning
The simplest and most widely used tokenization strategy is uniform binning : divide each action dimension's range into $K$ equally-spaced bins and assign each bin a unique token index.
Given a continuous action value $a$ in dimension $i$ with known bounds $[a_{\min}, a_{\max}]$, the bin index is:
Intuitively, this maps the continuous range $[a_{\min}, a_{\max}]$ onto the integers $\{0, 1, \ldots, K-1\}$. With $K = 256$ (the most common choice), each bin covers a range of $\frac{a_{\max} - a_{\min}}{255}$.
To convert back from a bin index to a continuous value (de-tokenization), we use the bin centre:
The $+0.5$ centres the reconstructed value within its bin rather than placing it at the bin's left edge, minimising the maximum reconstruction error.
The maximum quantization error — the worst-case difference between the true action and its de-tokenized reconstruction — is bounded by half a bin width:
For a typical action range of $[-1, 1]$ with $K = 256$, this gives a maximum error of $\frac{2}{512} \approx 0.0039$. For coarse tasks like pick-and-place, this precision is usually sufficient. For fine manipulation (e.g., inserting a USB plug), it may not be.
RT-2 — The First VLA
RT-2 (Robotics Transformer 2) [1] was the first model to demonstrate that a large VLM could be directly co-fine-tuned to output robot actions. Published by Google DeepMind in 2023, its key finding: web-scale visual-language pre-training transfers meaningfully to robotic control, enabling capabilities like semantic reasoning ("move the coke can to the Taylor Swift picture") that purely robotic models could not achieve.
RT-2 builds on two VLM backbones:
- PaLM-E (12B) [2] : A multimodal model that integrates ViT visual features directly into PaLM's embedding space.
- PaLI-X (55B) [3] : A vision-language model with a ViT-22B vision encoder and a 32B language model, representing the largest VLA at the time of publication.
The architecture is straightforward: take a pre-trained VLM, add 256 new tokens to its vocabulary (one per bin), and co-fine-tune on a mixture of web data and robot demonstration data. At each timestep, the robot captures an image, the model receives the image tokens plus the language instruction, and it autoregressively generates 7 action tokens (6 DoF end-effector deltas + 1 gripper state).
The autoregressive action generation process factorises the joint action probability as:
where $a_t^{(i)}$ is the $i$-th action dimension's bin token, and $d = 7$. Each dimension is conditioned on all previously generated dimensions, allowing the model to capture dependencies between action dimensions (e.g., the $z$-position depends on the $x$ and $y$ positions chosen).
The training objective is the standard cross-entropy loss, summed over all action dimensions:
where $a_t^{(i)*}$ denotes the ground-truth bin for dimension $i$.
A remarkable finding was that RT-2 could perform chain-of-thought reasoning for robotic tasks. When prompted with multi-step instructions or questions requiring spatial/semantic reasoning, the model could generate intermediate text reasoning steps before outputting actions — leveraging capabilities inherited from its VLM pre-training.
Limitations of Discrete Tokenization
While action tokenization elegantly repurposes the LLM machinery, it introduces several fundamental limitations:
- Uni-modal assumption: Autoregressive generation commits to one action sequence. At each dimension, the model picks the most likely bin. But many tasks have multi-modal action distributions — there are multiple correct ways to grasp an object. Autoregressive models can struggle to represent this: they may average between modes (producing an invalid "compromise" action) or collapse to always choosing one mode.
- Precision ceiling: With 256 bins, the resolution is fixed. Some tasks require sub-millimetre precision (circuit board insertion, surgical robotics) where the quantization error becomes the limiting factor. Increasing $K$ is possible but adds vocabulary size and inference cost.
- Sequential decoding latency: To predict a 7-DoF action, the model must run 7 forward passes through the transformer (one per dimension). For real-time control at 10 Hz, each forward pass must complete in under ~14 ms — challenging for models with billions of parameters.
- No temporal coherence: Each timestep's action is predicted independently. There is no mechanism to enforce smooth trajectories across time — the model might predict a sharp change in velocity from one step to the next, leading to jerky motion.
These limitations motivate an alternative family of approaches — continuous action generation via diffusion [4] or flow matching [5] — which we will explore in articles 4 and 5.
Quiz
Test your understanding of action tokenization and RT-2.
What is the primary advantage of turning robot actions into tokens?
If K = 256 bins are used to tokenize an action dimension with range [-1, 1], what is the maximum quantization error?
How does RT-2 generate a 7-DoF action at each timestep?
Why is sequential decoding latency a problem for autoregressive VLAs in real-time robot control?
What is the key capability that RT-2 inherited from its VLM pre-training that purely robotic models lacked?