Diffusion Policy: Continuous Action Generation

Why Diffusion for Actions?

The discrete tokenization approach (RT-2, OpenVLA) is elegant but makes a strong assumption: for any given observation and instruction, there is essentially one best action . In practice, the model typically picks the highest-probability bin for each dimension and moves on.

In practice, robotic tasks are often multi-modal — there are multiple equally valid ways to accomplish the same goal. Consider picking up a mug: you could approach from the left, the right, or from above. You could grasp the handle or the body. Each is a perfectly valid action, but they are very different in action space.

If the training data contains demonstrations of all these strategies, an autoregressive model faces a problem: it may try to average between modes , producing an action that falls between two valid grasps — and is itself invalid (e.g., reaching for empty space between the handle and the body).

Diffusion models are designed to handle exactly this: they can represent complex, multi-modal distributions and sample from them. Instead of predicting a single action, they generate actions by iteratively denoising random noise into a coherent action, naturally capturing the full distribution of valid behaviours.

💡 A useful analogy: autoregressive models are like GPS that picks one route. Diffusion models are like a map that shows all possible routes and lets you sample one — naturally handling the case where multiple routes are equally good.

DDPM Recap

Diffusion Policy (Chi et al., 2024) is built on Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) . Before diving into the robotic application, let's quickly review how DDPM works.

The core idea has two phases:

Forward process (adding noise): Start with a clean data sample $x_0$ (in our case, an action sequence) and progressively add Gaussian noise over $T$ steps according to a variance schedule $\beta_1, \ldots, \beta_T$:

q(x_t | x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I)

A useful property lets us jump directly to any timestep without iterating:

x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where $\bar{\alpha}_t = \prod_{s=1}^{t}(1 - \beta_s)$. At $t = T$, $x_T$ is approximately pure Gaussian noise.

Reverse process (removing noise): Train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ that was added at each step. Given a noisy sample $x_t$, the model predicts the noise, and we take a small step toward the clean data:

x_{t-1} = \frac{1}{\sqrt{1-\beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z

where $z \sim \mathcal{N}(0, I)$ and $\sigma_t$ is the noise scale for step $t$. Starting from pure noise $x_T \sim \mathcal{N}(0, I)$, we iterate this reverse step $T$ times to arrive at a clean sample $x_0$.

The training objective is simple — minimise the mean squared error between the predicted and actual noise:

\mathcal{L}_{\text{DDPM}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]

Intuitively: we take clean data, corrupt it with random noise, and train the model to undo the corruption. The model becomes a general-purpose denoiser that can transform random noise into data.

Diffusion Policy Architecture

Diffusion Policy applies DDPM to robot action prediction. Instead of generating images, the diffusion model generates action sequences — specifically, a chunk of $H$ future actions conditioned on the current observation.

The model takes as input:

Noisy action chunk: $A_t^k \in \mathbb{R}^{H \times d}$ — a sequence of $H$ actions, each $d$-dimensional, at diffusion step $k$
Diffusion timestep: $k \in \{1, \ldots, K\}$ indicating the noise level
Observation features: $O_t$ — visual features from a CNN or ViT encoder, possibly combined with proprioceptive state (joint angles)

And outputs a noise prediction $\hat{\epsilon} \in \mathbb{R}^{H \times d}$. The denoiser architecture can be either:

1D temporal U-Net: Treats the action chunk as a 1D signal along the time axis. Convolutional layers capture local temporal patterns, while skip connections preserve fine-grained details. The observation features are injected via FiLM conditioning. This is the more compute-efficient option.
Transformer: Treats each action timestep as a token. Observation features and the diffusion timestep are prepended as additional tokens. Self-attention captures long-range temporal dependencies. This scales better but is more expensive.

The conditioned denoising process can be written as:

A_t^{k-1} = \text{Denoise}(A_t^k,\, k,\, O_t;\, \theta)

Starting from $A_t^K \sim \mathcal{N}(0, I)$ and iterating $K$ times produces the clean action chunk $A_t^0 \in \mathbb{R}^{H \times d}$.

The Action Chunk Idea

One of the most important innovations in Diffusion Policy is action chunking — predicting not just the next single action, but an entire sequence of $H$ future actions at once.

Why predict a chunk instead of a single action?

Temporal consistency: Single-step prediction can produce jerky trajectories — each step is predicted independently with no guarantee of smoothness. By predicting $H$ steps at once, the model learns to produce smooth, coherent trajectories.
Multi-modality over trajectories: With single-step prediction, multi-modality manifests as oscillation between modes (step left, step right, step left...). With chunks, the model commits to a complete strategy for $H$ steps, avoiding incoherent mode-switching.
Amortised computation: Instead of running $K$ denoising steps for every single action, we run $K$ steps once and get $H$ actions. If $H = 16$ and the control frequency is 10 Hz, one denoising pass covers 1.6 seconds of future actions.

In practice, not all $H$ predicted actions are executed. A common strategy is receding-horizon execution : execute only the first $h < H$ actions (e.g., $h = 8$), then re-predict a new chunk from the updated observation. This gives a balance between temporal consistency (from the chunk) and reactivity (from frequent re-planning).

💡 Action chunking is analogous to how humans plan: you don't decide each muscle twitch independently. When reaching for a cup, you plan the entire reaching motion as one smooth trajectory, then adjust as you go.

Training & Inference

Training Diffusion Policy is straightforward. Given a demonstration dataset of (observation, action chunk) pairs:

1. Sample a batch of clean action chunks $A_t^0$ from the dataset
2. Sample random diffusion timesteps $k \sim \text{Uniform}(1, K)$ and noise $\epsilon \sim \mathcal{N}(0, I)$
3. Create noisy actions: $A_t^k = \sqrt{\bar{\alpha}_k}\, A_t^0 + \sqrt{1 - \bar{\alpha}_k}\, \epsilon$
4. Predict noise: $\hat{\epsilon} = \epsilon_\theta(A_t^k, k, O_t)$
5. Minimise: $\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2$

At inference time, standard DDPM requires $K$ denoising steps (typically $K = 100$). For real-time robot control, this is often too slow. DDIM (Denoising Diffusion Implicit Models) (Song et al., 2021) provides a deterministic, accelerated sampling procedure that can produce good results with as few as 10-20 steps:

A_t^{k-1} = \sqrt{\bar{\alpha}_{k-1}} \left( \frac{A_t^k - \sqrt{1-\bar{\alpha}_k}\, \epsilon_\theta}{\sqrt{\bar{\alpha}_k}} \right) + \sqrt{1-\bar{\alpha}_{k-1}}\, \epsilon_\theta

This formula has two sub-expressions, each with a clear role. The first term, $\sqrt{\bar{\alpha}_{k-1}} \bigl(\frac{A_t^k - \sqrt{1-\bar{\alpha}_k}\,\epsilon_\theta}{\sqrt{\bar{\alpha}_k}}\bigr)$, is the predicted clean signal re-scaled to noise level $k{-}1$. The inner fraction strips the noise from $A_t^k$ using the model's noise prediction $\epsilon_\theta$, recovering an estimate of the clean action chunk $A_t^0$. Multiplying by $\sqrt{\bar{\alpha}_{k-1}}$ then re-adds the correct (smaller) amount of signal scaling for step $k{-}1$. The second term, $\sqrt{1-\bar{\alpha}_{k-1}}\,\epsilon_\theta$, re-injects the predicted noise component at the target noise level. Together, the two terms reconstruct $A_t^{k-1}$ by blending the estimated clean signal and noise at the proportions appropriate for step $k{-}1$ — jumping directly from noise level $k$ to $k{-}1$ without needing the intermediate steps that DDPM requires. Because the noise prediction $\epsilon_\theta$ is reused deterministically (no fresh randomness $z$ is sampled), DDIM is a deterministic mapping from noise to data, which is why it can safely skip steps.

DDIM skips intermediate steps while maintaining sample quality. In practice, Diffusion Policy with DDIM achieves inference times of ~50-100 ms per action chunk — fast enough for 10 Hz control when combined with action chunking.

On benchmark tasks (Push-T, robotic manipulation), Diffusion Policy significantly outperformed prior methods in the evaluations reported by the authors, including behaviour cloning with MLP/GMM heads, IBC (Implicit Behaviour Cloning) (Florence et al., 2022) , and BeT (Behaviour Transformers) (Shafiullah et al., 2022) , particularly on tasks with multi-modal action distributions.

Quiz

Test your understanding of Diffusion Policy and DDPM.

What problem arises when an autoregressive model is trained on demonstrations showing multiple valid grasping strategies?

The model trains too slowly to converge

The model may average between modes, producing invalid compromise actions

The model forgets earlier demonstrations

The model cannot process images with multiple objects

What does the DDPM training loss minimise?

The difference between input and output images

The MSE between predicted noise and actual noise added to the clean data

The cross-entropy between predicted and ground-truth action tokens

The KL divergence between the forward and reverse processes

What is the shape of Diffusion Policy's output at each denoising step?

A single d-dimensional action vector

A noise prediction of shape H × d (matching the action chunk)

A probability distribution over K bins

A scalar value representing the predicted noise level

What is receding-horizon execution in the context of Diffusion Policy?

Executing all H predicted actions before re-planning

Executing only the first h < H actions, then re-predicting a new chunk from updated observations

Gradually reducing the prediction horizon over time

Executing actions in reverse order for safety

Why is DDIM used instead of standard DDPM at inference time for Diffusion Policy?

DDIM produces more diverse actions

DDIM is deterministic and can produce good results with far fewer denoising steps (10-20 vs 100)

DDIM requires less GPU memory

DDIM can generate actions for multiple robots simultaneously