Why Diffusion for Actions?

The discrete tokenization approach (RT-2, OpenVLA) is elegant but makes a strong assumption: for any given observation and instruction, there is essentially one best action . The model picks the highest-probability bin for each dimension and moves on.

In practice, robotic tasks are often multi-modal — there are multiple equally valid ways to accomplish the same goal. Consider picking up a mug: you could approach from the left, the right, or from above. You could grasp the handle or the body. Each is a perfectly valid action, but they are very different in action space.

If the training data contains demonstrations of all these strategies, an autoregressive model faces a problem: it may try to average between modes , producing an action that falls between two valid grasps — and is itself invalid (e.g., reaching for empty space between the handle and the body).

Diffusion models are designed to handle exactly this: they can represent complex, multi-modal distributions and sample from them. Instead of predicting a single action, they generate actions by iteratively denoising random noise into a coherent action, naturally capturing the full distribution of valid behaviours.

💡 A useful analogy: autoregressive models are like GPS that picks one route. Diffusion models are like a map that shows all possible routes and lets you sample one — naturally handling the case where multiple routes are equally good.

DDPM Recap

Diffusion Policy [1] is built on Denoising Diffusion Probabilistic Models (DDPM) [2] . Before diving into the robotic application, let's quickly review how DDPM works.

The core idea has two phases:

Forward process (adding noise): Start with a clean data sample $x_0$ (in our case, an action sequence) and progressively add Gaussian noise over $T$ steps according to a variance schedule $\beta_1, \ldots, \beta_T$:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I)$$

A useful property lets us jump directly to any timestep without iterating:

$$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

where $\bar{\alpha}_t = \prod_{s=1}^{t}(1 - \beta_s)$. At $t = T$, $x_T$ is approximately pure Gaussian noise.

Reverse process (removing noise): Train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ that was added at each step. Given a noisy sample $x_t$, the model predicts the noise, and we take a small step toward the clean data:

$$x_{t-1} = \frac{1}{\sqrt{1-\beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z$$

where $z \sim \mathcal{N}(0, I)$ and $\sigma_t$ is the noise scale for step $t$. Starting from pure noise $x_T \sim \mathcal{N}(0, I)$, we iterate this reverse step $T$ times to arrive at a clean sample $x_0$.

The training objective is simple — minimise the mean squared error between the predicted and actual noise:

$$\mathcal{L}_{\text{DDPM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]$$

Intuitively: we take clean data, corrupt it with random noise, and train the model to undo the corruption. The model becomes a general-purpose denoiser that can transform random noise into data.

Diffusion Policy Architecture

Diffusion Policy applies DDPM to robot action prediction. Instead of generating images, the diffusion model generates action sequences — specifically, a chunk of $H$ future actions conditioned on the current observation.

The model takes as input:

  • Noisy action chunk: $A_t^k \in \mathbb{R}^{H \times d}$ — a sequence of $H$ actions, each $d$-dimensional, at diffusion step $k$
  • Diffusion timestep: $k \in \{1, \ldots, K\}$ indicating the noise level
  • Observation features: $O_t$ — visual features from a CNN or ViT encoder, possibly combined with proprioceptive state (joint angles)

And outputs a noise prediction $\hat{\epsilon} \in \mathbb{R}^{H \times d}$. The denoiser architecture can be either:

  • 1D temporal U-Net: Treats the action chunk as a 1D signal along the time axis. Convolutional layers capture local temporal patterns, while skip connections preserve fine-grained details. The observation features are injected via FiLM conditioning. This is the more compute-efficient option.
  • Transformer: Treats each action timestep as a token. Observation features and the diffusion timestep are prepended as additional tokens. Self-attention captures long-range temporal dependencies. This scales better but is more expensive.

The conditioned denoising process can be written as:

$$A_t^{k-1} = \text{Denoise}(A_t^k,\, k,\, O_t;\, \theta)$$

Starting from $A_t^K \sim \mathcal{N}(0, I)$ and iterating $K$ times produces the clean action chunk $A_t^0 \in \mathbb{R}^{H \times d}$.

The Action Chunk Idea

One of the most important innovations in Diffusion Policy is action chunking — predicting not just the next single action, but an entire sequence of $H$ future actions at once.

Why predict a chunk instead of a single action?

  • Temporal consistency: Single-step prediction can produce jerky trajectories — each step is predicted independently with no guarantee of smoothness. By predicting $H$ steps at once, the model learns to produce smooth, coherent trajectories.
  • Multi-modality over trajectories: With single-step prediction, multi-modality manifests as oscillation between modes (step left, step right, step left...). With chunks, the model commits to a complete strategy for $H$ steps, avoiding incoherent mode-switching.
  • Amortised computation: Instead of running $K$ denoising steps for every single action, we run $K$ steps once and get $H$ actions. If $H = 16$ and the control frequency is 10 Hz, one denoising pass covers 1.6 seconds of future actions.

In practice, not all $H$ predicted actions are executed. A common strategy is receding-horizon execution : execute only the first $h < H$ actions (e.g., $h = 8$), then re-predict a new chunk from the updated observation. This gives a balance between temporal consistency (from the chunk) and reactivity (from frequent re-planning).

💡 Action chunking is analogous to how humans plan: you don't decide each muscle twitch independently. When reaching for a cup, you plan the entire reaching motion as one smooth trajectory, then adjust as you go.

Training & Inference

Training Diffusion Policy is straightforward. Given a demonstration dataset of (observation, action chunk) pairs:

  • 1. Sample a batch of clean action chunks $A_t^0$ from the dataset
  • 2. Sample random diffusion timesteps $k \sim \text{Uniform}(1, K)$ and noise $\epsilon \sim \mathcal{N}(0, I)$
  • 3. Create noisy actions: $A_t^k = \sqrt{\bar{\alpha}_k}\, A_t^0 + \sqrt{1 - \bar{\alpha}_k}\, \epsilon$
  • 4. Predict noise: $\hat{\epsilon} = \epsilon_\theta(A_t^k, k, O_t)$
  • 5. Minimise: $\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2$

At inference time, standard DDPM requires $K$ denoising steps (typically $K = 100$). For real-time robot control, this is often too slow. DDIM (Denoising Diffusion Implicit Models) [3] provides a deterministic, accelerated sampling procedure that can produce good results with as few as 10-20 steps:

$$A_t^{k-1} = \sqrt{\bar{\alpha}_{k-1}} \left( \frac{A_t^k - \sqrt{1-\bar{\alpha}_k}\, \epsilon_\theta}{\sqrt{\bar{\alpha}_k}} \right) + \sqrt{1-\bar{\alpha}_{k-1}}\, \epsilon_\theta$$

DDIM skips intermediate steps while maintaining sample quality. In practice, Diffusion Policy with DDIM achieves inference times of ~50-100 ms per action chunk — fast enough for 10 Hz control when combined with action chunking.

On benchmark tasks (Push-T, robotic manipulation), Diffusion Policy significantly outperformed prior methods including behaviour cloning with MLP/GMM heads, IBC (Implicit Behaviour Cloning) [4] , and BeT (Behaviour Transformers) [5] , particularly on tasks with multi-modal action distributions.

Quiz

Test your understanding of Diffusion Policy and DDPM.

What problem arises when an autoregressive model is trained on demonstrations showing multiple valid grasping strategies?

What does the DDPM training loss minimise?

What is the shape of Diffusion Policy's output at each denoising step?

What is receding-horizon execution in the context of Diffusion Policy?

Why is DDIM used instead of standard DDPM at inference time for Diffusion Policy?