From Diffusion to Flow

Diffusion Policy showed that denoising-based generation is a powerful framework for robotic action prediction. But DDPM has a fundamental inefficiency: the paths it takes through data space are curved . The forward process spirals outward from data to noise, and the reverse process must carefully retrace these curved paths — requiring many small steps (50-100) to produce high-quality samples.

Flow matching asks a natural question: what if we could take straight-line paths instead? If we learn a velocity field that transports samples from noise to data along straight lines, we should need far fewer steps to traverse the same distance.

This is not just a theoretical nicety. In robotics, where actions must be generated in real-time (often at 10+ Hz), reducing the number of inference steps from 50-100 to 5-10 can make the difference between a feasible and infeasible system.

💡 Think of it this way: DDPM is like navigating with winding country roads — you'll get there, but it takes many turns. Flow matching is like taking a highway — a straight shot from A to B.

Conditional Flow Matching

Conditional Flow Matching (CFM) [1] [2] provides a simulation-free framework for learning continuous normalising flows. The key idea is elegant: define a simple, straight-line path between noise and data, then learn a velocity field that follows it.

We define a time-dependent interpolation between a noise sample $x_0 \sim \mathcal{N}(0, I)$ and a data sample $x_1$ from our dataset:

$$x_t = (1 - t)\, x_0 + t\, x_1, \quad t \in [0, 1]$$

This is simply linear interpolation: at $t = 0$ we have pure noise, at $t = 1$ we have clean data, and in between we have a weighted blend. The path from $x_0$ to $x_1$ is a straight line in data space.

The velocity along this path — how fast and in what direction we need to move — is constant and trivially computable:

$$v_{\text{target}} = \frac{dx_t}{dt} = x_1 - x_0$$

This is the direction from noise to data — a single vector, constant in time. Our neural network $v_\theta(x_t, t)$ is trained to predict this velocity at any point $(x_t, t)$ along the path.

The training loss is simply the mean squared error between the predicted and target velocities:

$$\mathcal{L}_{\text{CFM}} = \mathbb{E}_{t \sim U(0,1),\, x_0 \sim \mathcal{N}(0,I),\, x_1 \sim p_{\text{data}}} \left[ \| v_\theta(x_t, t) - (x_1 - x_0) \|^2 \right]$$

Compare this to DDPM's loss $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$. They look almost identical — but the difference is critical. DDPM predicts noise (which tells you what to subtract), while flow matching predicts velocity (which tells you where to go). The velocity formulation leads to straighter paths and faster convergence.

At inference time, we solve an ODE starting from noise $x_0 \sim \mathcal{N}(0, I)$:

$$\frac{dx_t}{dt} = v_\theta(x_t, t)$$

Using a simple Euler solver with step size $\Delta t$:

$$x_{t + \Delta t} = x_t + \Delta t \cdot v_\theta(x_t, t)$$

With 5-10 Euler steps (i.e., $\Delta t = 0.1$ to $0.2$), flow matching produces samples of comparable quality to DDPM with 50-100 steps. This is because the learned velocity field is nearly constant along each path (since the true velocity is constant), so large steps introduce little error.

📌 Technical note: CFM is closely related to "rectified flows" and can be seen as a special case of continuous normalising flows. The key insight — using straight-line interpolation paths — simplifies both training and inference compared to general flow matching formulations.

π₀ (pi-zero)

π₀ [3] from Physical Intelligence is the first VLA to combine a large vision-language model backbone with a flow matching action head. It demonstrates strong performance across a remarkable range of tasks: folding laundry, assembling boxes, cleaning tables, and bussing dishes.

The architecture has three key components:

  • VLM backbone: A pre-trained PaLI-Gemma [4] model (3B) that processes image tokens and language instruction tokens, providing the visual-linguistic understanding — "what am I looking at and what does the user want?"
  • Action expert: A set of dedicated transformer parameters (separate from the VLM) that process action chunk tokens. The action expert handles the denoising — "given the current noise level and what I understand about the scene, what should the clean action be?"
  • Flow matching head: Replaces the discrete tokenization of RT-2/OpenVLA with continuous action generation via the CFM framework. The action expert predicts the velocity field $v_\theta(A_t, t, O)$ that transports noisy action chunks toward clean ones.

A critical design choice is the separation of the action expert from the VLM backbone . During training, the VLM backbone parameters are shared between two objectives:

  • Language objective: Standard next-token prediction on text, preserving the VLM's linguistic capabilities
  • Action objective: Flow matching loss on action chunks, with gradients flowing through the action expert and into the VLM backbone

The action expert has its own set of transformer layers that interleave with the VLM backbone's layers via cross-attention. The action chunk (as a sequence of noisy action tokens) attends to the VLM's visual and language representations to extract the conditioning information it needs.

The π₀ action generation process at inference:

  • 1. Encode the image and language instruction through the VLM backbone
  • 2. Sample a noisy action chunk $A^0 \sim \mathcal{N}(0, I) \in \mathbb{R}^{H \times d}$
  • 3. Run $N$ Euler steps (typically $N = 10$):
$$A^{n+1} = A^n + \frac{1}{N} \cdot v_\theta(A^n, \tfrac{n}{N}, O)$$
  • 4. The final $A^N$ is the predicted action chunk. Execute the first $h$ actions and repeat.

Flow vs Diffusion in Practice

How do flow matching and diffusion compare in the context of robotic action generation?

  • Inference speed: Flow matching typically needs 5-10 ODE solver steps, compared to 20-100 for DDPM (even with DDIM). For a robot at 10 Hz with $H = 16$ action chunks: ~20-50 ms for flow vs ~50-100 ms for diffusion — a ~5× speedup.
  • Training simplicity: Both losses are MSE-based and equally simple to implement. However, flow matching avoids the need to define a noise schedule ($\beta_t$), which requires careful tuning in DDPM.
  • Sample quality: At equivalent compute budgets (same number of network evaluations), flow matching tends to produce slightly higher quality samples due to straighter interpolation paths.
  • Multi-modality: Both approaches handle multi-modal distributions well. The difference is primarily in inference efficiency, not expressiveness.

The practical takeaway: flow matching achieves comparable or better quality than diffusion with significantly fewer inference steps . For robotics, where real-time inference is critical, this makes flow matching the preferred choice for new systems — and it is the approach adopted by π₀, the current state-of-the-art.

Quiz

Test your understanding of flow matching and π₀.

What is the fundamental inefficiency of DDPM that flow matching addresses?

In conditional flow matching, what is the target velocity field for a linear interpolation path from noise x₀ to data x₁?

What does the flow matching network predict, compared to DDPM?

Why does π₀ separate its action expert from the VLM backbone?

How many ODE solver steps does flow matching typically need compared to DDPM's denoising steps?