Action Tokenization & RT-2

Why Tokenize Actions?

Large Language Models are token machines. They take in tokens, process them through transformer layers, and output a probability distribution over the next token. This autoregressive machinery is extraordinarily well-tested — enormous amounts of compute have gone into making it work reliably.

The core idea behind action tokenization is deceptively simple: if we can turn robot actions into tokens, we can reuse the entire LLM stack without modification. The vision encoder produces image tokens, the user provides language tokens, and the model generates action tokens — all flowing through the same transformer architecture.

This is attractive for several reasons. First, we inherit the LLM's pre-trained world knowledge and reasoning capabilities. Second, we avoid designing a custom action decoder. Third, the entire system can be trained end-to-end with the standard next-token prediction loss — cross-entropy over the vocabulary.

But actions are continuous (e.g., move the arm 0.0347 metres to the right), while tokens are discrete (integer indices into a vocabulary). We need a way to bridge this gap.

Uniform Binning

The simplest and most widely used tokenization strategy is uniform binning : divide each action dimension's range into $K$ equally-spaced bins and assign each bin a unique token index.

Given a continuous action value $a$ in dimension $i$ with known bounds $[a_{\min}, a_{\max}]$, the bin index is:

\text{bin}(a) = \left\lfloor \frac{a - a_{\min}}{a_{\max} - a_{\min}} \cdot (K - 1) \right\rfloor

Intuitively, this maps the continuous range $[a_{\min}, a_{\max}]$ onto the integers $\{0, 1, \ldots, K-1\}$. With $K = 256$ (the most common choice), each bin covers a range of $\frac{a_{\max} - a_{\min}}{255}$.

To convert back from a bin index to a continuous value (de-tokenization), we use the bin centre:

\hat{a} = a_{\min} + \left(\text{bin} + 0.5\right) \cdot \frac{a_{\max} - a_{\min}}{K}

The $+0.5$ centres the reconstructed value within its bin rather than placing it at the bin's left edge, minimising the maximum reconstruction error.

The maximum quantization error — the worst-case difference between the true action and its de-tokenized reconstruction — is bounded by half a bin width:

|a - \hat{a}| \leq \frac{a_{\max} - a_{\min}}{2K}

For a typical action range of $[-1, 1]$ with $K = 256$, this gives a maximum error of $\frac{2}{512} \approx 0.0039$. For coarse tasks like pick-and-place, this precision is usually sufficient. For fine manipulation (e.g., inserting a USB plug), it may not be.

💡 Think of uniform binning as rounding to a fixed number of decimal places. With 256 bins over [-1, 1], you are essentially rounding to about 2 decimal places — similar to saying "move 0.35 metres" instead of "move 0.3472 metres".

RT-2 — The First VLA

RT-2 (Robotics Transformer 2) (Brohan et al., 2023) was the first model to demonstrate that a large VLM could be directly co-fine-tuned to output robot actions. Published by Google DeepMind in 2023, its key finding was that web-scale visual-language pre-training transfers meaningfully to robotic control, enabling capabilities like semantic reasoning ("move the coke can to the Taylor Swift picture") that purely robotic models had not previously demonstrated.

RT-2 builds on two VLM backbones:

PaLM-E (12B) (Driess et al., 2023) : A multimodal model that integrates ViT visual features directly into PaLM's embedding space.
PaLI-X (55B) (Chen et al., 2023) : A vision-language model with a ViT-22B vision encoder and a 32B language model, representing the largest VLA at the time of publication.

The architecture is straightforward: take a pre-trained VLM, add 256 new tokens to its vocabulary (one per bin), and co-fine-tune on a mixture of web data and robot demonstration data. At each timestep, the robot captures an image, the model receives the image tokens plus the language instruction, and it autoregressively generates 7 action tokens (6 DoF end-effector deltas + 1 gripper state).

The autoregressive action generation process factorises the joint action probability as:

p(a_t | o_t, \ell) = \prod_{i=1}^{d} p(a_t^{(i)} | a_t^{(1)}, \ldots, a_t^{(i-1)}, o_t, \ell)

where $a_t^{(i)}$ is the $i$-th action dimension's bin token, and $d = 7$. Each dimension is conditioned on all previously generated dimensions, allowing the model to capture dependencies between action dimensions (e.g., the $z$-position depends on the $x$ and $y$ positions chosen).

The training objective is the standard cross-entropy loss, summed over all action dimensions:

\mathcal{L} = -\sum_{i=1}^{d} \log p\bigl(a_t^{(i)*} \;\big|\; a_t^{(1)*}, \ldots, a_t^{(i-1)*},\, o_t,\, \ell\bigr)

where $a_t^{(i)*}$ denotes the ground-truth bin for dimension $i$.

A remarkable finding was that RT-2 could perform chain-of-thought reasoning for robotic tasks. When prompted with multi-step instructions or questions requiring spatial/semantic reasoning, the model could generate intermediate text reasoning steps before outputting actions — leveraging capabilities inherited from its VLM pre-training.

Limitations of Discrete Tokenization

While action tokenization elegantly repurposes the LLM machinery, it introduces several fundamental limitations:

Uni-modal assumption: Autoregressive generation commits to one action sequence. At each dimension, the model picks the most likely bin. But many tasks have multi-modal action distributions — there are multiple correct ways to grasp an object. Autoregressive models can struggle to represent this: they may average between modes (producing an invalid "compromise" action) or collapse to always choosing one mode.
Precision ceiling: With 256 bins, the resolution is fixed. Some tasks require sub-millimetre precision (circuit board insertion, surgical robotics) where the quantization error becomes the limiting factor. Increasing $K$ is possible but adds vocabulary size and inference cost.
Sequential decoding latency: To predict a 7-DoF action, the model must run 7 forward passes through the transformer (one per dimension). For real-time control at 10 Hz, each forward pass must complete in under ~14 ms — challenging for models with billions of parameters.
No temporal coherence: Each timestep's action is predicted independently. There is no mechanism to enforce smooth trajectories across time — the model might predict a sharp change in velocity from one step to the next, leading to jerky motion.

These limitations motivate an alternative family of approaches — continuous action generation via diffusion (Chi et al., 2024) or flow matching (Lipman et al., 2023) — which we will explore in articles 4 and 5.

Quiz

Test your understanding of action tokenization and RT-2.

What is the primary advantage of turning robot actions into tokens?

It makes the actions more precise

It allows reusing the entire pre-trained LLM stack without architectural changes

It speeds up robot execution by 10x

It eliminates the need for a vision encoder

If K = 256 bins are used to tokenize an action dimension with range [-1, 1], what is the maximum quantization error?

1/128 ≈ 0.0078

1/256 ≈ 0.0039

1/512 ≈ 0.0020

1/64 ≈ 0.0156

How does RT-2 generate a 7-DoF action at each timestep?

It predicts all 7 dimensions simultaneously in one forward pass

It autoregressively generates 7 bin tokens, each conditioned on the previous ones

It uses a diffusion process to denoise 7 continuous values

It retrieves the closest action from a database of demonstrations

Why is sequential decoding latency a problem for autoregressive VLAs in real-time robot control?

The robot's sensors cannot keep up with the model's output rate

Predicting 7 action dimensions requires 7 full forward passes, which may exceed the ~100ms budget for 10 Hz control

The tokens are too large to fit in GPU memory

The robot's motors cannot handle discrete commands

What is the key capability that RT-2 inherited from its VLM pre-training that purely robotic models lacked?

Faster motor execution speed

Semantic reasoning and chain-of-thought planning for robotic tasks

Better depth perception from stereo cameras

The ability to operate multiple robots simultaneously