What's Wrong with DDPM?

DDPM works. It generates stunning images, it has a rigorous mathematical foundation, and it drove the first wave of diffusion-based image generation from 2020 to 2023. But after years of practical use, three weaknesses became increasingly hard to ignore.

First, the forward process is fixed, not learned . DDPM defines a predetermined noise schedule $\{\beta_1, \ldots, \beta_T\}$ that controls how data is corrupted into noise. This schedule is designed by hand (typically linear or cosine), and the reverse process must undo this specific corruption. The model has no say in how the forward trajectories are shaped — it can only learn to reverse whatever path the fixed schedule imposes.

Second, sampling requires many steps . The forward process follows a curved path through data space — the noise schedule creates a spiral from data to Gaussian noise, and the reverse process must carefully retrace this winding trajectory. Each step makes a small correction, and if you skip too many steps, the accumulated error derails the image. In practice, DDPM needs 50-1000 denoising steps to produce clean outputs. That is slow, especially for real-time applications.

Third, the math involves complex variational bounds . The DDPM training objective is derived from a variational lower bound on log-likelihood, involving terms like $\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s)$, signal-to-noise ratio weighting, and a posterior distribution $q(x_{t-1} \mid x_t, x_0)$ that requires careful algebraic manipulation. The simplified noise-prediction loss $\|\epsilon_\theta(x_t, t) - \epsilon\|^2$ hides a lot of machinery underneath.

These three issues share a common cause: the forward process dictates everything, and the model is forced to work within its constraints. What if we could sidestep the fixed diffusion process entirely? What if, instead of following curved paths from data to noise and back, we could learn straight paths ? Straight paths would mean fewer steps (less distance to travel), simpler math (no noise schedule gymnastics), and a more flexible framework. That is exactly what flow matching provides.

Flow Matching: Learning Straight Paths

Flow matching (Lipman et al., 2023) reframes generative modelling as learning a flow : a time-dependent vector field that transports samples from a noise distribution to a data distribution. Instead of the two-phase diffusion setup (fixed forward corruption, learned reverse denoising), flow matching defines a direct path from noise to data and trains a neural network to follow it.

The interpolation path. The simplest choice is a straight line. Given a data sample $x_0$ and a noise sample $\epsilon \sim \mathcal{N}(0, I)$, define the intermediate point at time $t \in [0, 1]$ as:

$$x_t = (1 - t)\,\epsilon + t\,x_0$$

Let's verify this at the boundaries. At $t = 0$: $x_0^{\text{path}} = (1 - 0)\,\epsilon + 0 \cdot x_0 = \epsilon$ — pure noise. At $t = 1$: $x_1^{\text{path}} = (1 - 1)\,\epsilon + 1 \cdot x_0 = x_0$ — clean data. At $t = 0.5$: $x_{0.5} = 0.5\,\epsilon + 0.5\,x_0$ — a 50/50 blend of noise and data. The path traces a straight line in data space from the noise sample to the data sample.

Compare this with DDPM's forward process: $x_t = \sqrt{\bar{\alpha}_t}\,x_0 + \sqrt{1 - \bar{\alpha}_t}\,\epsilon$. The coefficients $\sqrt{\bar{\alpha}_t}$ and $\sqrt{1 - \bar{\alpha}_t}$ don't sum to 1 — they lie on a quarter-circle (since $(\sqrt{\bar{\alpha}_t})^2 + (\sqrt{1-\bar{\alpha}_t})^2 = 1$), which means the DDPM path curves through data space. Flow matching's linear coefficients $(1-t)$ and $t$ sum to exactly 1, producing a straight line.

💡 The convention here is $t = 0$ for noise and $t = 1$ for data, which is the opposite of DDPM's convention (where $t = 0$ is clean data and $t = T$ is noise). This reversal is intentional: in flow matching, time flows in the generation direction (from noise toward data), so larger $t$ means closer to the final clean image.

The velocity. If the path is $x_t = (1 - t)\,\epsilon + t\,x_0$, what is its velocity? Just take the time derivative:

$$v(x_t, t) = \frac{dx_t}{dt} = x_0 - \epsilon$$

The velocity is constant — it does not depend on $t$ at all. It points from the noise sample $\epsilon$ toward the data sample $x_0$, and its magnitude is the Euclidean distance between them. This constancy is exactly what makes the path straight: the direction and speed never change along the trajectory.

The training loss. We train a neural network $v_\theta(x_t, t)$ to predict this velocity. The loss is a simple mean-squared error:

$$\mathcal{L} = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\|v_\theta(x_t, t) - (x_0 - \epsilon)\|^2\right]$$

The training procedure is remarkably straightforward:

  • Sample $t \sim U(0, 1)$ — a uniform random timestep
  • Sample $x_0$ from the training data
  • Sample $\epsilon \sim \mathcal{N}(0, I)$ — a random noise vector from the normal distribution
  • Compute $x_t = (1 - t)\,\epsilon + t\,x_0$
  • Compute the target velocity: $x_0 - \epsilon$
  • Minimise $\|v_\theta(x_t, t) - (x_0 - \epsilon)\|^2$

Notice what is absent : no noise schedule $\{\beta_t\}$, no cumulative product $\bar{\alpha}_t$, no signal-to-noise ratio weighting. The training target is simply $x_0 - \epsilon$, the difference between data and noise. This is arguably even simpler than DDPM's noise-prediction loss $\|\epsilon_\theta(x_t, t) - \epsilon\|^2$, because the interpolation formula itself is simpler (linear coefficients instead of square-root products).

Sampling. To generate a new image, start from pure noise $x_0 = \epsilon \sim \mathcal{N}(0, I)$ and solve the ordinary differential equation (ODE):

$$\frac{dx}{dt} = v_\theta(x_t, t)$$

from $t = 0$ to $t = 1$. The simplest solver is Euler's method: divide $[0, 1]$ into $N$ steps of size $\Delta t = 1/N$, and iterate:

$$x_{t + \Delta t} = x_t + \Delta t \cdot v_\theta(x_t, t)$$

Here is the key payoff of straight paths: if the learned vector field is perfectly straight, a single Euler step would suffice (a straight line needs only its start point and direction). In practice, the learned field is not perfectly straight — there are many data points pulling in different directions — but it is much straighter than DDPM's curved reverse trajectories. This means fewer Euler steps are needed for the same quality. Where DDPM might need 50-1000 steps, flow matching typically works well with 20-50 steps.

import math

# Compare path coefficients: DDPM vs flow matching
print("t     | DDPM coeff(x0)  DDPM coeff(eps) | FM coeff(x0)  FM coeff(eps)")
print("-" * 78)

T_ddpm = 1000
beta_start, beta_end = 0.0001, 0.02

for t_frac in [0.0, 0.25, 0.5, 0.75, 1.0]:
    # Flow matching: linear
    fm_x0 = t_frac
    fm_eps = 1 - t_frac

    # DDPM: compute alpha_bar at equivalent timestep
    t_step = int(t_frac * (T_ddpm - 1))
    alpha_bar = 1.0
    for s in range(t_step + 1):
        beta_s = beta_start + (beta_end - beta_start) * s / (T_ddpm - 1)
        alpha_bar *= (1 - beta_s)
    ddpm_x0 = math.sqrt(alpha_bar)
    ddpm_eps = math.sqrt(1 - alpha_bar)

    print(f"{t_frac:.2f}  |   {ddpm_x0:.4f}          {ddpm_eps:.4f}       |   {fm_x0:.4f}         {fm_eps:.4f}")

print()
print("DDPM coefficients follow a quarter-circle (squares sum to 1)")
print("Flow matching coefficients follow a straight line (values sum to 1)")

Rectified Flows: Making Paths Even Straighter

The linear interpolation $x_t = (1 - t)\,\epsilon + t\,x_0$ defines straight paths between individual noise-data pairs. But the learned vector field $v_\theta(x_t, t)$ must handle all possible pairs simultaneously. At any point $x_t$ in space, the network sees contributions from many different data samples pulling in different directions. The result: even though each training pair defines a straight path, the averaged vector field that the network learns can produce curved trajectories when you actually integrate the ODE. The paths cross and interfere with each other.

Rectified Flow (Liu et al., 2023) solves this with an elegant iterative procedure called rectification . The idea is:

  • Step 1: Train a flow matching model $v_\theta$ on random (noise, data) pairs using the standard loss.
  • Step 2: Use the trained model to generate new data samples: start from noise $\epsilon_i$ and integrate the ODE to get $\hat{x}_i$. Now you have paired trajectories $(\epsilon_i, \hat{x}_i)$ that the model actually traverses.
  • Step 3: Train a new model on these $(\epsilon_i, \hat{x}_i)$ pairs. Since these pairs were generated by following the model's own vector field, the straight-line interpolation between them is much closer to the actual ODE trajectory.
  • Repeat: Each round of rectification makes the paths straighter. After 2-3 rounds, the trajectories are nearly linear.

Why does this work? The first model maps random noise samples to data samples, but the pairing is arbitrary — noise vector $\epsilon_i$ has no special relationship with data point $x_j$, so their straight-line interpolation may cross other trajectories. After one round of generation, $\epsilon_i$ and $\hat{x}_i$ are related — they are the endpoints of the same ODE trajectory. Training on these paired endpoints means the straight-line interpolation now approximates the actual path the model takes, reducing the interference between crossing trajectories.

The practical consequence is dramatic: after rectification, the paths are straight enough that 1-4 Euler steps can produce high-quality samples. Compare that to 20-50 steps for standard flow matching or 50-1000 for DDPM. This is not just a speedup — it enables real-time generation and makes diffusion-style models viable for latency-sensitive applications. This is also what makes flow matching attractive for robotics applications , where actions must be generated at 10+ Hz.

Rectified flows are the foundation of the latest generation of image generators. Stable Diffusion 3 and Flux both use rectified flow matching instead of DDPM, precisely because straighter paths mean fewer sampling steps at inference time.

# Demonstrate how rectification straightens paths
# We simulate a 1D example: noise -> data mapping

import math

# Suppose we have 4 noise-data pairs
pairs_round0 = [
    (-2.0,  3.0),   # noise=-2, data=3
    (-1.0,  1.0),   # noise=-1, data=1
    ( 0.5, -0.5),   # noise=0.5, data=-0.5
    ( 1.5,  2.5),   # noise=1.5, data=2.5
]

def path_curvature(pairs):
    """Measure how much paths cross by counting intersections at t=0.5"""
    midpoints = [(1 - 0.5) * n + 0.5 * d for n, d in pairs]
    crossings = 0
    for i in range(len(pairs)):
        for j in range(i + 1, len(pairs)):
            # Paths cross if ordering of endpoints flips
            noise_order = pairs[i][0] < pairs[j][0]
            mid_order = midpoints[i] < midpoints[j]
            if noise_order != mid_order:
                crossings += 1
    return crossings

def straightness(pairs, steps=20):
    """Measure deviation from straight line (lower = straighter)"""
    total_dev = 0
    for noise, data in pairs:
        velocity = data - noise
        # Ideal straight path
        for s in range(1, steps):
            t = s / steps
            ideal = (1 - t) * noise + t * data
            # Simulate "curved" path from interference
            actual = ideal  # In our simple model, paths are already linear per-pair
        total_dev += abs(velocity)
    return total_dev / len(pairs)

crossings_r0 = path_curvature(pairs_round0)

# After "rectification": re-pair by sorting (simulating ODE endpoint matching)
noises = sorted([n for n, d in pairs_round0])
datas = sorted([d for n, d in pairs_round0])
pairs_round1 = list(zip(noises, datas))  # Monotone pairing = no crossings

crossings_r1 = path_curvature(pairs_round1)

print("Rectification reduces path crossings")
print("=" * 55)
print(f"\nRound 0 (random pairing):")
for n, d in pairs_round0:
    print(f"  noise={n:+.1f} -> data={d:+.1f}  (velocity={d-n:+.1f})")
print(f"  Path crossings at t=0.5: {crossings_r0}")

print(f"\nRound 1 (rectified pairing):")
for n, d in pairs_round1:
    print(f"  noise={n:+.1f} -> data={d:+.1f}  (velocity={d-n:+.1f})")
print(f"  Path crossings at t=0.5: {crossings_r1}")

print(f"\nCrossings reduced from {crossings_r0} to {crossings_r1}")
print("Fewer crossings = straighter learned vector field = fewer steps needed")

Flow Matching vs DDPM: A Side-by-Side

Both DDPM and flow matching learn to map noise to data, but they differ in almost every detail of how that mapping is defined, trained, and sampled. Let's lay the two approaches side by side.

What the model predicts. DDPM trains a network $\epsilon_\theta(x_t, t)$ to predict the noise that was added at timestep $t$. Flow matching trains a network $v_\theta(x_t, t)$ to predict the velocity — the direction and magnitude of the step from noise toward data. These are related: if you know the noise $\epsilon$ and the data $x_0$, the velocity is $v = x_0 - \epsilon$. But the framing matters, because it determines how the network's output is used during sampling.

How sampling works. DDPM generates images via iterative denoising : start from noise $x_T \sim \mathcal{N}(0, I)$, and at each step predict the noise, subtract a scaled version of it, and optionally add a small amount of fresh noise (the stochastic part of DDPM). Flow matching generates images via ODE integration : start from noise $x_0 \sim \mathcal{N}(0, I)$ and integrate the velocity field forward using Euler steps (or a higher-order solver). The flow matching sampler is deterministic — no noise is added during sampling.

Path geometry. DDPM's paths are curved because the forward process coefficients $\sqrt{\bar{\alpha}_t}$ and $\sqrt{1 - \bar{\alpha}_t}$ trace a quarter-circle. The reverse process must follow these curves, making each step a small angular correction. Flow matching's paths are straight (or nearly so, after rectification). Straighter paths mean larger steps are safe, because the direction changes less between steps.

Practical consequences:

  • Step count: Flow matching typically needs 20-50 steps for high quality. With rectification, 1-4 steps. DDPM needs 50-1000 steps, or 20-50 with advanced schedulers like DDIM.
  • Training simplicity: Flow matching has no noise schedule $\{\beta_t\}$, no $\bar{\alpha}_t$ products, no SNR weighting. The interpolation is linear and the target is $x_0 - \epsilon$.
  • Flexibility: Flow matching works with any source distribution, not just Gaussian noise. You could in principle transport from one image distribution to another, though Gaussian noise remains the standard starting point.
  • Network architecture: Both can use the same denoiser backbone (U-Net or DiT). The architecture is agnostic — only the training objective and sampling algorithm change.
import json, js

rows = [
    ["Prediction target", "Noise (epsilon)", "Velocity (x0 - epsilon)"],
    ["Interpolation", "sqrt(alpha_bar)*x0 + sqrt(1-alpha_bar)*eps", "(1-t)*eps + t*x0"],
    ["Path shape", "Curved (quarter-circle)", "Straight (linear)"],
    ["Sampling method", "Iterative denoising (reverse SDE)", "ODE integration (Euler)"],
    ["Typical steps", "50-1000 (20-50 with DDIM)", "20-50 (1-4 with rectification)"],
    ["Noise schedule", "Required (beta_t, alpha_bar_t)", "Not needed"],
    ["Training loss", "||eps_theta - eps||^2 (+ SNR weighting)", "||v_theta - (x0 - eps)||^2"],
    ["Stochasticity", "Optional noise at each step", "Deterministic ODE"],
    ["Source distribution", "Gaussian only", "Any distribution"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Property", "DDPM", "Flow Matching"],
    "rows": rows
})

print("DDPM vs Flow Matching: key differences across 9 dimensions")
import math, json, js

# Visualise the path geometry difference
# DDPM: coefficients trace a quarter-circle
# Flow matching: coefficients trace a straight line

T = 1000
beta_start, beta_end = 0.0001, 0.02

rows = []
for t_frac in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    # Flow matching
    fm_sig = t_frac
    fm_noise = 1.0 - t_frac

    # DDPM
    t_step = int(t_frac * (T - 1))
    alpha_bar = 1.0
    for s in range(t_step + 1):
        beta_s = beta_start + (beta_end - beta_start) * s / (T - 1)
        alpha_bar *= (1 - beta_s)
    ddpm_sig = math.sqrt(alpha_bar)
    ddpm_noise = math.sqrt(1 - alpha_bar)

    rows.append([f"{t_frac:.1f}", f"{ddpm_sig:.4f}", f"{ddpm_noise:.4f}", f"{fm_sig:.4f}", f"{fm_noise:.4f}"])

js.window.py_table_data = json.dumps({
    "headers": ["t", "DDPM signal", "DDPM noise", "FM signal", "FM noise"],
    "rows": rows
})

print("DDPM: signal^2 + noise^2 = 1 (quarter-circle in coefficient space)")
print("FM:   signal   + noise   = 1 (straight line in coefficient space)")
print("\nThe straight line means uniform progress from noise to data.")
print("The quarter-circle means most change is compressed into the middle timesteps.")

Why SD3 and Flux Chose Flow Matching

Stable Diffusion 3 (Esser et al., 2024) marked a decisive break from DDPM. It adopted rectified flow matching as its training framework, paired with the MMDiT transformer architecture we covered in article 5. Flux (Black Forest Labs, 2024) followed the same approach. The shift from DDPM to flow matching in 2023-2024 mirrors the parallel shift from U-Net to transformer denoisers: in both cases, the field moved toward simpler, more scalable foundations.

But SD3 did not adopt vanilla flow matching unchanged. It introduced a key training refinement: logit-normal timestep sampling . In standard flow matching, we sample $t \sim U(0,1)$ — every timestep gets equal training weight. But not all timesteps are equally important. At $t \approx 0$ the input is nearly pure noise and the model's job is trivial (predict roughly $x_0 - \epsilon$, which is close to just predicting the data mean). At $t \approx 1$ the input is nearly clean data and again the model's job is easy (small corrections). The hard decisions happen in the middle range ($t \approx 0.3$ to $0.7$), where the model must resolve ambiguous structure — is that blob a face or a building?

Logit-normal sampling concentrates training on these critical middle timesteps. Instead of $t \sim U(0, 1)$, SD3 samples:

$$t = \sigma\!\left(z\right), \quad z \sim \mathcal{N}(0, 1)$$

where $\sigma$ is the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$. Let's verify what this does at the boundaries. When $z \to -\infty$: $\sigma(z) \to 0$, so $t \to 0$ (extreme noise). When $z \to +\infty$: $\sigma(z) \to 1$, so $t \to 1$ (clean data). When $z = 0$: $\sigma(0) = 0.5$, so $t = 0.5$ (the midpoint). Since $z \sim \mathcal{N}(0, 1)$, most samples of $z$ are near 0, which means most samples of $t$ are near 0.5 — exactly the middle timesteps where training signal matters most. The tails ($t$ near 0 or 1) are sampled rarely, which is appropriate since the model's task there is easier.

import math, json, js

# Show how logit-normal sampling concentrates on middle timesteps
# Compare: uniform vs logit-normal distribution of t

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

# Sample from logit-normal: z ~ N(0,1), t = sigmoid(z)
# We'll compute the density at various t values
# For logit-normal: p(t) = (1 / (t*(1-t))) * phi(logit(t))
# where phi is the standard normal pdf and logit(t) = log(t/(1-t))

def normal_pdf(z):
    return math.exp(-0.5 * z * z) / math.sqrt(2 * math.pi)

def logit_normal_pdf(t):
    if t <= 0.001 or t >= 0.999:
        return 0.0
    logit_t = math.log(t / (1 - t))
    return normal_pdf(logit_t) / (t * (1 - t))

rows = []
for t in [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95]:
    u_density = 1.0  # uniform density is 1 everywhere
    ln_density = logit_normal_pdf(t)
    ratio = ln_density / u_density
    rows.append([f"{t:.2f}", f"{u_density:.3f}", f"{ln_density:.3f}", f"{ratio:.2f}x"])

js.window.py_table_data = json.dumps({
    "headers": ["t", "Uniform", "Logit-Normal", "Ratio"],
    "rows": rows
})

print("Logit-normal samples t=0.5 about 1.6x more often than uniform,")
print("while t=0.05 and t=0.95 are sampled ~3x less often.")
print("This focuses training on the 'interesting' middle timesteps.")

The combined recipe behind SD3 and Flux is: (1) rectified flow matching for the training framework (straight paths, velocity prediction, ODE sampling), (2) a transformer-based denoiser (DiT/MMDiT) instead of a U-Net, and (3) logit-normal timestep sampling to focus training where it matters. Each component is simpler than what it replaced, and together they produce state-of-the-art image quality with fewer sampling steps.

💡 The shift to flow matching also unifies image generation with other domains. The same framework is used for robotics action prediction (flow matching policies generate robot trajectories in 5-10 steps instead of 50-100). See the article on flow matching for robotics for more details.

Quiz

Test your understanding of flow matching and rectified flows.

In flow matching, the interpolation path is $x_t = (1 - t)\epsilon + t\,x_0$. What is the velocity $dx_t/dt$?

What problem does rectification solve in flow matching?

Why does SD3 use logit-normal timestep sampling instead of uniform sampling?

How does flow matching's interpolation path differ from DDPM's forward process geometrically?