Flow Matching & π₀

From Diffusion to Flow

Diffusion Policy showed that denoising-based generation is a powerful framework for robotic action prediction. But DDPM has a fundamental inefficiency: the paths it takes through data space are curved . The forward process spirals outward from data to noise, and the reverse process must carefully retrace these curved paths — requiring many small steps (50-100) to produce high-quality samples.

Flow matching asks a natural question: what if we could take straight-line paths instead? If we learn a velocity field that transports samples from noise to data along straight lines, we should need far fewer steps to traverse the same distance.

This is not just a theoretical nicety. In robotics, where actions must be generated in real-time (often at 10+ Hz), reducing the number of inference steps from 50-100 to 5-10 can make the difference between a feasible and infeasible system.

💡 Think of it this way: DDPM is like navigating with winding country roads — you'll get there, but it takes many turns. Flow matching is like taking a highway — a straight shot from A to B.

Conditional Flow Matching

Conditional Flow Matching (CFM) (Lipman et al., 2023) (Tong et al., 2024) provides a simulation-free framework for learning continuous normalising flows. The key idea is elegant: define a simple, straight-line path between noise and data, then learn a velocity field that follows it.

We define a time-dependent interpolation between a noise sample $x_0 \sim \mathcal{N}(0, I)$ and a data sample $x_1$ from our dataset:

x_t = (1 - t)\, x_0 + t\, x_1, \quad t \in [0, 1]

This is simply linear interpolation: at $t = 0$ we have pure noise, at $t = 1$ we have clean data, and in between we have a weighted blend. The path from $x_0$ to $x_1$ is a straight line in data space.

The velocity along this path — how fast and in what direction we need to move — is constant and trivially computable:

v_{\text{target}} = \frac{dx_t}{dt} = x_1 - x_0

This is the direction from noise to data — a single vector, constant in time. Our neural network $v_\theta(x_t, t)$ is trained to predict this velocity at any point $(x_t, t)$ along the path.

The training loss is simply the mean squared error between the predicted and target velocities:

\mathcal{L}_{\text{CFM}} = \mathbb{E}_{t \sim U(0,1),\, x_0 \sim \mathcal{N}(0,I),\, x_1 \sim p_{\text{data}}} \left[ \| v_\theta(x_t, t) - (x_1 - x_0) \|^2 \right]

Compare this to DDPM's loss $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$. They look almost identical — but the difference is critical. DDPM predicts noise (which tells you what to subtract), while flow matching predicts velocity (which tells you where to go). The velocity formulation leads to straighter paths and faster convergence.

At inference time, we solve an ODE starting from noise $x_0 \sim \mathcal{N}(0, I)$:

\frac{dx_t}{dt} = v_\theta(x_t, t)

Using a simple Euler solver with step size $\Delta t$:

x_{t + \Delta t} = x_t + \Delta t \cdot v_\theta(x_t, t)

With 5-10 Euler steps (i.e., $\Delta t = 0.1$ to $0.2$), flow matching produces samples of comparable quality to DDPM with 50-100 steps. This is because the learned velocity field is nearly constant along each path (since the true velocity is constant), so large steps introduce little error.

📌 Technical note: CFM is closely related to "rectified flows" and can be seen as a special case of continuous normalising flows. The key insight — using straight-line interpolation paths — simplifies both training and inference compared to general flow matching formulations.

π₀ (pi-zero)

π₀ (Black et al., 2024) from Physical Intelligence is the first VLA to combine a large vision-language model backbone with a flow matching action head. The authors report strong performance across a wide range of tasks: folding laundry, assembling boxes, cleaning tables, and bussing dishes.

The architecture has three key components:

VLM backbone: A pre-trained PaLI-Gemma (Beyer et al., 2024) model (3B) that processes image tokens and language instruction tokens, providing the visual-linguistic understanding — "what am I looking at and what does the user want?"
Action expert: A set of dedicated transformer parameters (separate from the VLM) that process action chunk tokens. The action expert handles the denoising — "given the current noise level and what I understand about the scene, what should the clean action be?"
Flow matching head: Replaces the discrete tokenization of RT-2/OpenVLA with continuous action generation via the CFM framework. The action expert predicts the velocity field $v_\theta(A_t, t, O)$ that transports noisy action chunks toward clean ones.

A critical design choice is the separation of the action expert from the VLM backbone . During training, the VLM backbone parameters are shared between two objectives:

Language objective: Standard next-token prediction on text, preserving the VLM's linguistic capabilities
Action objective: Flow matching loss on action chunks, with gradients flowing through the action expert and into the VLM backbone

The action expert has its own set of transformer layers that interleave with the VLM backbone's layers via cross-attention . The action chunk (as a sequence of noisy action tokens) attends to the VLM's visual and language representations to extract the conditioning information it needs.

The π₀ action generation process at inference:

1. Encode the image and language instruction through the VLM backbone
2. Sample a noisy action chunk $A^0 \sim \mathcal{N}(0, I) \in \mathbb{R}^{H \times d}$
3. Run $N$ Euler steps (typically $N = 10$):

A^{n+1} = A^n + \frac{1}{N} \cdot v_\theta(A^n, \tfrac{n}{N}, O)

4. The final $A^N$ is the predicted action chunk. Execute the first $h$ actions and repeat.

Flow vs Diffusion in Practice

How do flow matching and diffusion compare in the context of robotic action generation?

Inference speed: Flow matching typically needs 5-10 ODE solver steps, compared to 20-100 for DDPM (even with DDIM). For a robot at 10 Hz with $H = 16$ action chunks: ~20-50 ms for flow vs ~50-100 ms for diffusion — a ~5× speedup.
Training simplicity: Both losses are MSE-based and equally simple to implement. However, flow matching avoids the need to define a noise schedule ($\beta_t$), which requires careful tuning in DDPM.
Sample quality: At equivalent compute budgets (same number of network evaluations), flow matching tends to produce slightly higher quality samples due to straighter interpolation paths.
Multi-modality: Both approaches handle multi-modal distributions well. The difference is primarily in inference efficiency, not expressiveness.

The practical takeaway: flow matching tends to achieve comparable or better quality than diffusion with significantly fewer inference steps . For robotics, where real-time inference is critical, this makes flow matching an increasingly preferred choice for new systems — and it is the approach adopted by π₀, one of the strongest VLAs at the time of writing.

Quiz

Test your understanding of flow matching and π₀.

What is the fundamental inefficiency of DDPM that flow matching addresses?

DDPM uses too much GPU memory

DDPM's paths through data space are curved, requiring many small steps

DDPM cannot handle multi-modal distributions

DDPM requires labelled data for training

In conditional flow matching, what is the target velocity field for a linear interpolation path from noise x₀ to data x₁?

The noise sample x₀

The data sample x₁

The difference x₁ - x₀

The midpoint (x₀ + x₁) / 2

What does the flow matching network predict, compared to DDPM?

Both predict noise

Flow matching predicts velocity (where to go), DDPM predicts noise (what to subtract)

Flow matching predicts the clean data directly, DDPM predicts noise

Both predict velocity but with different parameterisations

Why does π₀ separate its action expert from the VLM backbone?

To reduce the total number of parameters

So the VLM can be trained on language tasks while the action expert handles denoising, preserving linguistic capabilities

Because the VLM cannot process visual inputs

To allow the model to run on different GPUs

How many ODE solver steps does flow matching typically need compared to DDPM's denoising steps?

About the same number (50-100)

5-10 steps vs 50-100 for DDPM

1-2 steps vs 10 for DDPM

More steps than DDPM but each step is faster