Why Diffusion for Actions?
The discrete tokenization approach (RT-2, OpenVLA) is elegant but makes a strong assumption: for any given observation and instruction, there is essentially one best action . In practice, the model typically picks the highest-probability bin for each dimension and moves on.
In practice, robotic tasks are often multi-modal — there are multiple equally valid ways to accomplish the same goal. Consider picking up a mug: you could approach from the left, the right, or from above. You could grasp the handle or the body. Each is a perfectly valid action, but they are very different in action space.
If the training data contains demonstrations of all these strategies, an autoregressive model faces a problem: it may try to average between modes , producing an action that falls between two valid grasps — and is itself invalid (e.g., reaching for empty space between the handle and the body).
Diffusion models are designed to handle exactly this: they can represent complex, multi-modal distributions and sample from them. Instead of predicting a single action, they generate actions by iteratively denoising random noise into a coherent action, naturally capturing the full distribution of valid behaviours.
DDPM Recap
Diffusion Policy (Chi et al., 2024) is built on Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020) . Before diving into the robotic application, let's quickly review how DDPM works.
The core idea has two phases:
Forward process (adding noise): Start with a clean data sample $x_0$ (in our case, an action sequence) and progressively add Gaussian noise over $T$ steps according to a variance schedule $\beta_1, \ldots, \beta_T$:
A useful property lets us jump directly to any timestep without iterating:
where $\bar{\alpha}_t = \prod_{s=1}^{t}(1 - \beta_s)$. At $t = T$, $x_T$ is approximately pure Gaussian noise.
Reverse process (removing noise): Train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ that was added at each step. Given a noisy sample $x_t$, the model predicts the noise, and we take a small step toward the clean data:
where $z \sim \mathcal{N}(0, I)$ and $\sigma_t$ is the noise scale for step $t$. Starting from pure noise $x_T \sim \mathcal{N}(0, I)$, we iterate this reverse step $T$ times to arrive at a clean sample $x_0$.
The training objective is simple — minimise the mean squared error between the predicted and actual noise:
Intuitively: we take clean data, corrupt it with random noise, and train the model to undo the corruption. The model becomes a general-purpose denoiser that can transform random noise into data.
Diffusion Policy Architecture
Diffusion Policy applies DDPM to robot action prediction. Instead of generating images, the diffusion model generates action sequences — specifically, a chunk of $H$ future actions conditioned on the current observation.
The model takes as input:
- Noisy action chunk: $A_t^k \in \mathbb{R}^{H \times d}$ — a sequence of $H$ actions, each $d$-dimensional, at diffusion step $k$
- Diffusion timestep: $k \in \{1, \ldots, K\}$ indicating the noise level
- Observation features: $O_t$ — visual features from a CNN or ViT encoder, possibly combined with proprioceptive state (joint angles)
And outputs a noise prediction $\hat{\epsilon} \in \mathbb{R}^{H \times d}$. The denoiser architecture can be either:
- 1D temporal U-Net: Treats the action chunk as a 1D signal along the time axis. Convolutional layers capture local temporal patterns, while skip connections preserve fine-grained details. The observation features are injected via FiLM conditioning. This is the more compute-efficient option.
- Transformer: Treats each action timestep as a token. Observation features and the diffusion timestep are prepended as additional tokens. Self-attention captures long-range temporal dependencies. This scales better but is more expensive.
The conditioned denoising process can be written as:
Starting from $A_t^K \sim \mathcal{N}(0, I)$ and iterating $K$ times produces the clean action chunk $A_t^0 \in \mathbb{R}^{H \times d}$.
The Action Chunk Idea
One of the most important innovations in Diffusion Policy is action chunking — predicting not just the next single action, but an entire sequence of $H$ future actions at once.
Why predict a chunk instead of a single action?
- Temporal consistency: Single-step prediction can produce jerky trajectories — each step is predicted independently with no guarantee of smoothness. By predicting $H$ steps at once, the model learns to produce smooth, coherent trajectories.
- Multi-modality over trajectories: With single-step prediction, multi-modality manifests as oscillation between modes (step left, step right, step left...). With chunks, the model commits to a complete strategy for $H$ steps, avoiding incoherent mode-switching.
- Amortised computation: Instead of running $K$ denoising steps for every single action, we run $K$ steps once and get $H$ actions. If $H = 16$ and the control frequency is 10 Hz, one denoising pass covers 1.6 seconds of future actions.
In practice, not all $H$ predicted actions are executed. A common strategy is receding-horizon execution : execute only the first $h < H$ actions (e.g., $h = 8$), then re-predict a new chunk from the updated observation. This gives a balance between temporal consistency (from the chunk) and reactivity (from frequent re-planning).
Training & Inference
Training Diffusion Policy is straightforward. Given a demonstration dataset of (observation, action chunk) pairs:
- 1. Sample a batch of clean action chunks $A_t^0$ from the dataset
- 2. Sample random diffusion timesteps $k \sim \text{Uniform}(1, K)$ and noise $\epsilon \sim \mathcal{N}(0, I)$
- 3. Create noisy actions: $A_t^k = \sqrt{\bar{\alpha}_k}\, A_t^0 + \sqrt{1 - \bar{\alpha}_k}\, \epsilon$
- 4. Predict noise: $\hat{\epsilon} = \epsilon_\theta(A_t^k, k, O_t)$
- 5. Minimise: $\mathcal{L} = \|\epsilon - \hat{\epsilon}\|^2$
At inference time, standard DDPM requires $K$ denoising steps (typically $K = 100$). For real-time robot control, this is often too slow. DDIM (Denoising Diffusion Implicit Models) (Song et al., 2021) provides a deterministic, accelerated sampling procedure that can produce good results with as few as 10-20 steps:
This formula has two sub-expressions, each with a clear role. The first term, $\sqrt{\bar{\alpha}_{k-1}} \bigl(\frac{A_t^k - \sqrt{1-\bar{\alpha}_k}\,\epsilon_\theta}{\sqrt{\bar{\alpha}_k}}\bigr)$, is the predicted clean signal re-scaled to noise level $k{-}1$. The inner fraction strips the noise from $A_t^k$ using the model's noise prediction $\epsilon_\theta$, recovering an estimate of the clean action chunk $A_t^0$. Multiplying by $\sqrt{\bar{\alpha}_{k-1}}$ then re-adds the correct (smaller) amount of signal scaling for step $k{-}1$. The second term, $\sqrt{1-\bar{\alpha}_{k-1}}\,\epsilon_\theta$, re-injects the predicted noise component at the target noise level. Together, the two terms reconstruct $A_t^{k-1}$ by blending the estimated clean signal and noise at the proportions appropriate for step $k{-}1$ — jumping directly from noise level $k$ to $k{-}1$ without needing the intermediate steps that DDPM requires. Because the noise prediction $\epsilon_\theta$ is reused deterministically (no fresh randomness $z$ is sampled), DDIM is a deterministic mapping from noise to data, which is why it can safely skip steps.
DDIM skips intermediate steps while maintaining sample quality. In practice, Diffusion Policy with DDIM achieves inference times of ~50-100 ms per action chunk — fast enough for 10 Hz control when combined with action chunking.
On benchmark tasks (Push-T, robotic manipulation), Diffusion Policy significantly outperformed prior methods in the evaluations reported by the authors, including behaviour cloning with MLP/GMM heads, IBC (Implicit Behaviour Cloning) (Florence et al., 2022) , and BeT (Behaviour Transformers) (Shafiullah et al., 2022) , particularly on tasks with multi-modal action distributions.
Quiz
Test your understanding of Diffusion Policy and DDPM.
What problem arises when an autoregressive model is trained on demonstrations showing multiple valid grasping strategies?
What does the DDPM training loss minimise?
What is the shape of Diffusion Policy's output at each denoising step?
What is receding-horizon execution in the context of Diffusion Policy?
Why is DDIM used instead of standard DDPM at inference time for Diffusion Policy?