The Diffusion Idea

How Do You Generate an Image from Noise?

Suppose you have thousands of photographs of faces and you want a model that can generate new, realistic faces that were never photographed. More formally, you have samples from some unknown data distribution $p(x)$, and you want to learn a model that can draw new samples from that same distribution. This is the generative modelling problem, and for years the dominant approach was Generative Adversarial Networks (GANs) : train a generator to produce fake images and a discriminator to tell fakes from reals, letting the two networks play a minimax game until the generator wins. GANs produced stunning images, but they were notoriously difficult to train. Mode collapse (the generator learns to produce only a few types of outputs, ignoring the diversity of the real data) and training instability (the generator and discriminator oscillate instead of converging) plagued practitioners and required extensive hyperparameter tuning.

Diffusion models take a completely different approach. Instead of learning to generate an image directly from random noise in one shot, they learn to denoise . The core idea is surprisingly simple: take a clean image, gradually corrupt it by adding noise over many steps until it becomes pure static, then train a neural network to reverse each step. If the network learns to undo one small step of noise addition, we can chain those reversals together: start from pure noise and iteratively denoise until a clean image emerges. The model never has to solve the impossible problem of mapping random noise to a coherent image in one step. Instead, it solves a much easier problem (predict the noise that was added) thousands of times in sequence.

This two-phase structure defines every diffusion model. The forward process (also called the diffusion process) gradually destroys data by adding Gaussian noise over $T$ steps. It requires no learning — it is a fixed, known procedure. The reverse process learns to undo the destruction step by step, recovering structure from noise. The seminal paper that made this practical was (Ho et al., 2020) , which showed that a simple noise-prediction objective is all you need. The rest of this article unpacks exactly how that works.

The Forward Process: Gradually Destroying Data

Start with a clean image $x_0$ drawn from your training set. The forward process defines a chain of increasingly noisy versions $x_1, x_2, \ldots, x_T$, where each step adds a small amount of Gaussian noise controlled by a noise schedule $\{\beta_1, \beta_2, \ldots, \beta_T\}$. The transition from step $t-1$ to step $t$ is:

q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\; \sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I)

This says: to get $x_t$, take the previous image $x_{t-1}$, scale it down by $\sqrt{1 - \beta_t}$, and add Gaussian noise with variance $\beta_t$. The parameter $\beta_t$ is the noise level at step $t$, typically starting very small ($\beta_1 = 0.0001$) and increasing to $\beta_T = 0.02$ over $T = 1000$ steps. The scaling factor $\sqrt{1 - \beta_t}$ shrinks the signal slightly at each step, while $\beta_t$ injects noise. Together, these two terms ensure the overall variance stays bounded and doesn't explode to infinity across the chain.

Let's check the boundary cases to build intuition. When $\beta_t = 0$, the scaling factor becomes $\sqrt{1 - 0} = 1$ and the noise variance is $0$, so $x_t = x_{t-1}$ exactly — nothing happens. When $\beta_t = 1$, the scaling factor becomes $\sqrt{1 - 1} = 0$ and the noise variance is $1$, so the original signal is completely erased and replaced by pure Gaussian noise in a single step. In practice, each $\beta_t$ is very small, so each individual step barely changes the image. But after $T = 1000$ such steps, the cumulative effect is total destruction: $x_T \approx \mathcal{N}(0, I)$, pure Gaussian noise with no trace of the original image.

A crucial insight from the DDPM paper is that we don't need to iterate through all $t$ steps to get $x_t$. Define $\alpha_t = 1 - \beta_t$ and the cumulative product $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$. Then we can jump directly from the clean image $x_0$ to any noisy version $x_t$:

q(x_t \mid x_0) = \mathcal{N}(x_t;\; \sqrt{\bar{\alpha}_t}\, x_0,\; (1 - \bar{\alpha}_t)\, I)

In practice, this means we can sample $x_t$ directly using the reparameterisation trick :

x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This is a weighted sum of the original image and random noise. The coefficient $\sqrt{\bar{\alpha}_t}$ controls how much of the original signal survives, while $\sqrt{1 - \bar{\alpha}_t}$ controls how much noise is added. Since $\bar{\alpha}_t$ is a cumulative product of values slightly less than 1, it decays towards zero as $t$ increases. At $t = 0$, $\bar{\alpha}_0 = 1$, so $x_0 = 1 \cdot x_0 + 0 \cdot \epsilon$ — the clean image, untouched. At $t = T$, $\bar{\alpha}_T \approx 0$, so $x_T \approx 0 \cdot x_0 + 1 \cdot \epsilon = \epsilon$ — pure noise. The signal fades smoothly from full strength to nothing.

💡 Why does this closed-form shortcut work? Each forward step multiplies the signal by $\sqrt{\alpha_t}$ and adds independent Gaussian noise. Because the sum of independent Gaussians is Gaussian, and the variances add up, we can collapse $t$ sequential steps into one. The signal scaling compounds multiplicatively ($\sqrt{\alpha_1} \cdot \sqrt{\alpha_2} \cdots \sqrt{\alpha_t} = \sqrt{\bar{\alpha}_t}$), and the accumulated noise variance is $1 - \bar{\alpha}_t$. This shortcut is essential for training: we can create a noisy sample at any arbitrary timestep without simulating the entire chain.

The plot below shows how $\bar{\alpha}_t$ decays over 1000 timesteps using a linear noise schedule ($\beta_t$ increasing linearly from $0.0001$ to $0.02$). Notice that the signal strength drops slowly at first (the image is barely affected for the first few hundred steps), then accelerates in the middle, and by step 700-800 the original signal is almost entirely gone.

import math, json, js

T = 1000
beta_start = 0.0001
beta_end = 0.02

# Linear noise schedule
betas = [beta_start + (beta_end - beta_start) * t / (T - 1) for t in range(T)]

# Compute cumulative alpha_bar
alpha_bar = []
cumulative = 1.0
for beta in betas:
    cumulative *= (1.0 - beta)
    alpha_bar.append(cumulative)

# Also compute signal and noise coefficients
signal_coeff = [math.sqrt(ab) for ab in alpha_bar]
noise_coeff = [math.sqrt(1.0 - ab) for ab in alpha_bar]

timesteps = list(range(T))

plot_data = [
    {
        "title": "Forward process: signal decay over 1000 steps (linear schedule)",
        "x_label": "Timestep t",
        "y_label": "Coefficient value",
        "x_data": timesteps,
        "lines": [
            {"label": "alpha_bar_t (signal preserved)", "data": alpha_bar, "color": "#3b82f6"},
            {"label": "sqrt(alpha_bar_t) (signal coeff)", "data": signal_coeff, "color": "#10b981"},
            {"label": "sqrt(1 - alpha_bar_t) (noise coeff)", "data": noise_coeff, "color": "#ef4444"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

print(f"At t=0:    alpha_bar = {alpha_bar[0]:.6f}  (image is ~100% signal)")
print(f"At t=250:  alpha_bar = {alpha_bar[249]:.6f}  (still mostly signal)")
print(f"At t=500:  alpha_bar = {alpha_bar[499]:.6f}  (signal fading fast)")
print(f"At t=750:  alpha_bar = {alpha_bar[749]:.6f}  (almost all noise)")
print(f"At t=999:  alpha_bar = {alpha_bar[999]:.6f}  (essentially pure noise)")

The Reverse Process: Learning to Denoise

The forward process gives us a recipe for destroying images. Now we need to reverse it: given a noisy image $x_t$, recover the slightly less noisy image $x_{t-1}$. If we could do this perfectly for every step, we could start from pure noise $x_T \sim \mathcal{N}(0, I)$ and iteratively denoise all the way back to a clean image $x_0$. The remarkable fact (proven by Feller and others in the theory of stochastic processes) is that if each forward step adds only a small amount of noise, then the reverse step is also approximately Gaussian. So we parameterise the reverse process as:

p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\; \mu_\theta(x_t, t),\; \sigma_t^2 I)

Here $\mu_\theta(x_t, t)$ is the predicted mean — the model's best guess at the centre of the distribution over $x_{t-1}$ given the noisy input $x_t$ and the timestep $t$. The variance $\sigma_t^2$ is typically not learned but fixed to either $\beta_t$ or $\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$ (both work well in practice). The neural network only needs to learn the mean, which is where all the interesting work happens.

The key insight from (Ho et al., 2020) is that instead of predicting the mean $\mu_\theta$ directly, it is much more effective to predict the noise $\epsilon_\theta(x_t, t)$ that was added to create $x_t$. Why? Because once you know the noise, you can recover the mean analytically. Recall that $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$. If the model predicts $\epsilon$, we can solve for $x_0$ and then compute the posterior mean. The resulting formula for the predicted mean is:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right)

Let's unpack every piece. The outer factor $\frac{1}{\sqrt{\alpha_t}}$ rescales the result to undo the signal shrinkage from the forward step. Inside the parentheses, $x_t$ is the noisy input we currently have, and $\frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t)$ is the model's estimate of how much noise is embedded in $x_t$, appropriately scaled. The ratio $\frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}$ appears because $\beta_t$ is the noise added at this step and $\sqrt{1 - \bar{\alpha}_t}$ is the total noise accumulated up to step $t$. Subtracting the estimated noise from $x_t$ and rescaling gives us the predicted mean of $x_{t-1}$.

With the noise-prediction reparameterisation, the training loss becomes remarkably simple:

\mathcal{L} = \mathbb{E}_{t,\, x_0,\, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

That's it. The loss is the mean squared error between the actual noise $\epsilon$ that was sampled and the model's prediction $\epsilon_\theta(x_t, t)$. The expectation is over three random variables: a uniformly sampled timestep $t \sim \text{Uniform}(1, T)$, a clean training image $x_0$ from the dataset, and Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$. This loss is derived from the variational lower bound (the ELBO) on the log-likelihood, but Ho et al. showed that this simplified version actually works better in practice than the theoretically tighter bound.

The training algorithm is straightforward. In each iteration: (1) sample a clean image $x_0$ from your dataset, (2) sample a random timestep $t$, (3) sample noise $\epsilon \sim \mathcal{N}(0, I)$, (4) create the noisy image $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$, (5) feed $x_t$ and $t$ to the neural network to get $\epsilon_\theta(x_t, t)$, and (6) compute the MSE loss between $\epsilon$ and $\epsilon_\theta$. Backpropagate and repeat. The network (typically a U-Net architecture) takes a noisy image and a timestep embedding as input and outputs a noise prediction with the same dimensions as the image.

# model: U-Net that predicts noise given (x_t, t)
# alpha_bar: precomputed cumulative products

for epoch in range(num_epochs):
    for x_0 in dataloader:                         # (1) clean images
        t = randint(1, T)                           # (2) random timestep
        eps = torch.randn_like(x_0)                 # (3) sample noise

        # (4) create noisy image using closed-form shortcut
        x_t = sqrt(alpha_bar[t]) * x_0 + sqrt(1 - alpha_bar[t]) * eps

        # (5) predict the noise
        eps_pred = model(x_t, t)

        # (6) simple MSE loss
        loss = mse_loss(eps, eps_pred)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

To make this tangible, the code below demonstrates the forward and reverse logic on 1D data. We take a simple "signal" (a set of numbers), add noise to it according to the forward process, and then show what happens when the noise is predicted perfectly versus when it is predicted with some error.

import math, json, js

# --- 1D diffusion demo: forward + reverse on scalar data ---
# Our "image" is a single value
x_0 = 3.0

# Noise schedule (linear, 20 steps for clarity)
T = 20
beta_start, beta_end = 0.01, 0.3
betas = [beta_start + (beta_end - beta_start) * t / (T - 1) for t in range(T)]

# Precompute alpha_bar
alpha_bar = []
cum = 1.0
for b in betas:
    cum *= (1.0 - b)
    alpha_bar.append(cum)

# Forward process: noise x_0 at each timestep
# Using a FIXED seed so results are reproducible
import random
random.seed(42)

# Sample one noise value (in real diffusion, epsilon ~ N(0,I))
# We'll use a fixed epsilon for the demo
eps_true = 0.7  # pretend this was sampled from N(0,1)

forward_signal = [x_0]
forward_noise_level = [0.0]
for t in range(T):
    ab = alpha_bar[t]
    x_t = math.sqrt(ab) * x_0 + math.sqrt(1 - ab) * eps_true
    forward_signal.append(x_t)
    forward_noise_level.append(1 - ab)

# Reverse process: if we know eps_true perfectly, recover x_0
# mu_theta = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1-alpha_bar_t) * eps_pred)
print("=== Forward Process (destroying signal) ===")
print(f"x_0 = {x_0:.4f} (clean)")
for t in [4, 9, 14, 19]:
    ab = alpha_bar[t]
    x_t = math.sqrt(ab) * x_0 + math.sqrt(1 - ab) * eps_true
    print(f"x_{t+1:2d} = {x_t:.4f}  (alpha_bar = {ab:.4f}, signal: {math.sqrt(ab)*100:.1f}%, noise: {math.sqrt(1-ab)*100:.1f}%)")

print()
print("=== Reverse Process (one step, t=10 -> t=9) ===")
t = 9  # 0-indexed
ab = alpha_bar[t]
alpha_t = 1 - betas[t]
x_t = math.sqrt(ab) * x_0 + math.sqrt(1 - ab) * eps_true

# Perfect noise prediction
eps_pred_perfect = eps_true
mu_perfect = (1/math.sqrt(alpha_t)) * (x_t - betas[t]/math.sqrt(1 - ab) * eps_pred_perfect)

# Imperfect noise prediction (20% error)
eps_pred_bad = eps_true * 1.2
mu_bad = (1/math.sqrt(alpha_t)) * (x_t - betas[t]/math.sqrt(1 - ab) * eps_pred_bad)

# What x_{t-1} should be
ab_prev = alpha_bar[t-1]
x_t_minus_1_true = math.sqrt(ab_prev) * x_0 + math.sqrt(1 - ab_prev) * eps_true

print(f"x_10 = {x_t:.4f}")
print(f"True x_9 = {x_t_minus_1_true:.4f}")
print(f"Predicted mean (perfect eps):    mu = {mu_perfect:.4f}")
print(f"Predicted mean (20% eps error):  mu = {mu_bad:.4f}")
print()
print("Key insight: the model only needs to predict the noise accurately.")

Sampling: From Noise to Image

Once the model is trained, generating a new image is a matter of running the reverse process from start to finish. We begin by sampling pure noise $x_T \sim \mathcal{N}(0, I)$ and then iterate backwards from $t = T$ to $t = 1$. At each step, the model predicts the noise $\epsilon_\theta(x_t, t)$, we compute the mean $\mu_\theta(x_t, t)$ using the formula above, and then sample $x_{t-1}$ by adding a small amount of fresh Gaussian noise (for stochasticity). The sampling step is:

x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right) + \sigma_t\, z, \quad z \sim \mathcal{N}(0, I)

The first term is the predicted mean (subtracting the predicted noise and rescaling), and the second term $\sigma_t z$ adds fresh randomness. This stochasticity is important: it means different random seeds produce different images, giving us diversity in generation. At the very last step ($t = 1$), we typically set $z = 0$ (no added noise) so the final output is deterministic given the trajectory so far.

The obvious problem is speed . The original DDPM uses $T = 1000$ steps, and each step requires a full forward pass through the neural network (a large U-Net with hundreds of millions of parameters). Generating a single 256x256 image takes about 20 seconds on a modern GPU. Compare this to a GAN, which generates an image in a single forward pass (milliseconds). This speed gap motivated a wave of research into faster samplers.

The first breakthrough was DDIM (Denoising Diffusion Implicit Models) (Song et al., 2020) , which reinterprets the reverse process as a deterministic ODE (no added noise at each step), allowing it to skip steps . Instead of denoising through all 1000 timesteps, DDIM selects a subsequence (say, steps 1000, 950, 900, ..., 50, 0) and jumps between them. This reduces 1000 neural network evaluations to 50 or even 20, with only a modest drop in quality. The key insight is that the deterministic trajectory is smoother and more predictable than the stochastic one, so larger jumps are possible without losing coherence.

Modern samplers have pushed this even further. DPM-Solver treats the reverse process as an ODE and uses higher-order numerical methods (analogous to Runge-Kutta) to take fewer, more accurate steps. Euler and Heun samplers from the score-based diffusion framework achieve similar speedups. Today, state-of-the-art image generators like Stable Diffusion typically use 20-50 sampling steps, making generation fast enough for interactive use (1-3 seconds per image). The quality-speed tradeoff remains an active research area, but the gap has narrowed dramatically since the original 1000-step DDPM.

# model: trained noise-prediction network
# Start from pure noise
x = torch.randn(1, 3, 256, 256)  # random noise "image"

for t in reversed(range(1, T + 1)):
    # Predict the noise in x_t
    eps_pred = model(x, t)

    # Compute predicted mean
    mu = (1 / sqrt(alpha[t])) * (x - beta[t] / sqrt(1 - alpha_bar[t]) * eps_pred)

    # Add stochastic noise (except at t=1)
    if t > 1:
        z = torch.randn_like(x)
        x = mu + sigma[t] * z
    else:
        x = mu  # final step: no noise

# x is now a generated image

The Noise Schedule

The noise schedule $\{\beta_t\}_{t=1}^{T}$ determines how quickly information is destroyed during the forward process, and it has a significant effect on generation quality. A schedule that adds noise too aggressively will spend most of its steps operating on near-pure noise, giving the model very little signal to learn from. A schedule that adds noise too slowly will waste many steps where almost nothing changes, requiring a longer chain (and more sampling steps at generation time) without benefit.

The linear schedule from the original DDPM paper increases $\beta_t$ linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$. This is simple and effective, but it has a problem: the $\bar{\alpha}_t$ curve drops off too sharply in the middle of the chain. As we saw in the plot above, by step 600-700 the signal is nearly gone, meaning the last 300-400 steps of the forward process (and the first 300-400 steps of the reverse process) are spent in a region where $x_t$ is almost indistinguishable from pure noise. The model can't learn much from these steps because there's essentially no signal left to predict.

The cosine schedule , introduced by (Nichol & Dhariwal, 2021) , addresses this by designing the schedule so that $\bar{\alpha}_t$ follows a cosine curve, spreading the information destruction more evenly across timesteps. Specifically, they define $\bar{\alpha}_t = \frac{f(t)}{f(0)}$ where $f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$ and $s = 0.008$ is a small offset that prevents $\beta_t$ from being too small near $t = 0$. The resulting curve decays more gradually — the signal persists longer into the chain, and the transition from "mostly signal" to "mostly noise" is smoother. This means the model gets useful training signal across a wider range of timesteps.

The plot below compares the two schedules. Notice how the cosine schedule preserves the signal much longer (the blue line stays higher), giving the model more steps where meaningful denoising can be learned, while still reaching near-zero by the end of the chain.

import math, json, js

T = 1000

# --- Linear schedule ---
beta_start, beta_end = 0.0001, 0.02
betas_linear = [beta_start + (beta_end - beta_start) * t / (T - 1) for t in range(T)]

alpha_bar_linear = []
cum = 1.0
for b in betas_linear:
    cum *= (1.0 - b)
    alpha_bar_linear.append(cum)

# --- Cosine schedule (Nichol & Dhariwal, 2021) ---
s = 0.008

def f_cos(t):
    return math.cos(((t / T) + s) / (1 + s) * math.pi / 2) ** 2

alpha_bar_cosine = []
for t in range(T):
    alpha_bar_cosine.append(f_cos(t + 1) / f_cos(0))

# Clip to avoid numerical issues
alpha_bar_cosine = [max(ab, 1e-5) for ab in alpha_bar_cosine]

timesteps = list(range(T))

plot_data = [
    {
        "title": "Linear vs Cosine noise schedule: alpha_bar_t over 1000 steps",
        "x_label": "Timestep t",
        "y_label": "alpha_bar_t (signal preserved)",
        "x_data": timesteps,
        "lines": [
            {"label": "Linear schedule", "data": alpha_bar_linear, "color": "#ef4444"},
            {"label": "Cosine schedule", "data": alpha_bar_cosine, "color": "#3b82f6"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

# Find the timestep where alpha_bar drops below 0.1
for name, ab_list in [("Linear", alpha_bar_linear), ("Cosine", alpha_bar_cosine)]:
    for t, ab in enumerate(ab_list):
        if ab < 0.1:
            print(f"{name}: alpha_bar drops below 0.1 at step {t}")
            break

# Compare signal at t=500
print(f"\nAt t=500:")
print(f"  Linear: alpha_bar = {alpha_bar_linear[499]:.4f} ({math.sqrt(alpha_bar_linear[499])*100:.1f}% signal)")
print(f"  Cosine: alpha_bar = {alpha_bar_cosine[499]:.4f} ({math.sqrt(alpha_bar_cosine[499])*100:.1f}% signal)")

The choice of schedule also connects to the signal-to-noise ratio (SNR) at each timestep, defined as $\text{SNR}(t) = \bar{\alpha}_t / (1 - \bar{\alpha}_t)$. The linear schedule creates a region in the middle where the SNR drops very rapidly, meaning the model must learn a sharp transition. The cosine schedule spreads this transition more evenly in log-SNR space, which empirically leads to better sample quality, especially for images at lower resolutions where fine details matter throughout the process.

Quiz

Test your understanding of the diffusion model framework.

In the forward process, what does the cumulative product $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ represent?

The total amount of noise added up to step t

The fraction of the original signal that survives after t steps of noise addition

The learning rate at timestep t

The variance of the noise distribution at step t

What does the DDPM neural network learn to predict?

The clean image $x_0$ directly from the noisy image $x_t$

The noise $\epsilon$ that was added to create the noisy image $x_t$

The noise schedule $\beta_t$ for each timestep

The probability that an image is real vs generated

Why is the closed-form formula $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ essential for training?

It allows the model to generate images faster during inference

It reduces the number of parameters in the neural network

It lets us create a noisy sample at any arbitrary timestep t without simulating all t forward steps

It eliminates the need for a noise schedule

What is the main advantage of the cosine noise schedule over the linear schedule?

It requires fewer total timesteps T

It adds noise more uniformly, so the model gets useful training signal across a wider range of timesteps

It eliminates the need for the variance term in the reverse process

It makes the forward process reversible without a neural network