Classifier-Free Guidance

Why Do Unconditional Diffusion Models Generate Blurry Images?

An unconditional diffusion model learns to generate images from the full training distribution. If we trained on millions of photographs, the model learns to produce any plausible photograph. That sounds impressive, but it comes with a problem: the distribution is enormously broad. When the model starts denoising from random noise, it has to pick a direction — dog? landscape? portrait? — and if nothing constrains that choice, the model can hedge its bets by drifting toward a blurry average of many possible images rather than committing to a sharp, specific one.

Adding a text prompt ("a golden retriever playing in snow") narrows the distribution, but the model might still not commit strongly enough to the conditioned direction. The conditional prediction points toward dog-in-snow images, while the unconditional prior points toward the broad average of all images. If the model only follows the conditional signal at its default strength, the result can be washed out or generic.

What we need is a way to amplify the model's confidence in the text-conditioned direction: steer harder toward what the text asks for, and away from the generic average. That is exactly what guidance does. It turns a knob that controls how aggressively the denoising process follows the text prompt — the same kind of quality-vs-diversity tradeoff that temperature controls in language models.

Classifier Guidance (The Predecessor)

The first approach to guided diffusion came from (Dhariwal & Nichol, 2021) . Their paper, "Diffusion Models Beat GANs on Image Synthesis", showed that diffusion models could surpass GANs in image quality — but only with guidance. The idea: train a separate classifier $p(y|x_t)$ that can look at a noisy image $x_t$ at any timestep and predict what class $y$ it belongs to. Then use the gradient of this classifier to nudge each denoising step toward the desired class.

Formally, we want to sample from the conditional distribution $p(x_t | y)$ — images that belong to class $y$. By Bayes' rule, the score (gradient of the log-density) decomposes as:

\nabla_{x_t} \log p(x_t | y) = \nabla_{x_t} \log p(x_t) + s \cdot \nabla_{x_t} \log p(y | x_t)

Each term has a clear role:

$\nabla_{x_t} \log p(x_t)$: the unconditional score — the direction the diffusion model would denoise toward without any class information. This is just the standard denoising step.
$\nabla_{x_t} \log p(y | x_t)$: the classifier gradient — the direction that increases the classifier's confidence that the image belongs to class $y$. This is computed by backpropagating through the classifier with respect to the noisy image $x_t$.
$s$: the guidance scale — a scalar that controls how strongly the classifier pushes the denoising process. Higher $s$ means more aggressive steering toward class $y$.

When $s = 0$, we recover pure unconditional generation (the classifier is ignored). When $s = 1$, we get standard Bayesian conditioning. When $s > 1$, we over-amplify the classifier signal — producing sharper images of class $y$ at the cost of diversity. This was a breakthrough: it showed that guidance is the key ingredient for high-quality conditional generation.

But classifier guidance has a painful practical limitation: you need to train a separate classifier that works on noisy images at every noise level. Standard ImageNet classifiers only see clean images, so you need a specialised model trained on the noisy intermediate states of the diffusion process. That's extra training, extra compute, extra engineering — and it only works for class labels, not free-form text prompts.

Classifier-Free Guidance (CFG)

What if we could get the same guidance effect without any external classifier? (Ho & Salimans, 2022) proposed an elegant solution in "Classifier-Free Diffusion Guidance": use the diffusion model itself to provide both the conditional and unconditional signals.

The trick happens during training . With some probability (typically 10–20% of the time), the text condition $c$ is replaced with a null token $\varnothing$ (an empty or zero embedding). This means the same model learns to predict noise in two modes: conditioned on text ($\epsilon_\theta(x_t, c)$) and unconditioned ($\epsilon_\theta(x_t, \varnothing)$). No separate model needed — just one network that has seen both regimes.

During inference , we run the model twice per denoising step — once with the text prompt, once without — and combine the outputs:

\tilde{\epsilon}_\theta = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))

Let's unpack each component:

$\epsilon_\theta(x_t, c)$: the noise prediction with the text condition $c$. This is the model's best guess at what noise to remove, given the text prompt.
$\epsilon_\theta(x_t, \varnothing)$: the noise prediction without any condition (the unconditional baseline). This is where the model would denoise toward if it had no text guidance at all.
$(\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))$: the text direction — the difference between the conditional and unconditional predictions. This vector captures exactly what the text contributes beyond the generic unconditional prior. It is the "pure text signal".
$w$: the guidance scale (sometimes called CFG scale or guidance weight). We multiply the text direction by $w$ to amplify or attenuate it.

💡 The formula can be rearranged as $\tilde{\epsilon}_\theta = (1 - w) \cdot \epsilon_\theta(x_t, \varnothing) + w \cdot \epsilon_\theta(x_t, c)$. In this form, it's a weighted interpolation (or extrapolation when $w > 1$) between the unconditional and conditional predictions. When $w > 1$, we're extrapolating beyond the conditional prediction in the direction away from the unconditional one.

Now for the boundary analysis — what happens at different values of $w$?

$w = 0$: $\tilde{\epsilon} = \epsilon_\theta(x_t, \varnothing)$. Pure unconditional generation. The text prompt is completely ignored, and the model generates from the full training distribution — any image is equally likely.
$w = 1$: $\tilde{\epsilon} = \epsilon_\theta(x_t, c)$. Standard conditional generation with no amplification. The model uses the text exactly as-is, no more, no less. This is what you'd get without any guidance.
$w = 7.5$: the typical default for Stable Diffusion. The text direction is amplified 7.5$\times$, producing images that strongly match the prompt with good visual quality. This is the sweet spot most practitioners use.
$w > 15$: oversaturated territory. Colours become unnaturally vivid, details get exaggerated, and artifacts appear. The model "tries too hard" to match the text, pushing pixel values into extreme ranges. Quality degrades sharply.
$w < 0$: negative guidance . The text direction is reversed — the model actively steers away from the text description. This is the mechanism behind negative prompts, which we will cover in the next section.

The code below demonstrates CFG with concrete numbers. We simulate two noise predictions (conditional and unconditional) and show how different guidance scales transform the final output.

import math, json, js

# Simulated noise predictions for a single pixel/dimension
eps_uncond = 0.3   # unconditional prediction
eps_cond   = 0.7   # conditional prediction (with text)
text_dir = eps_cond - eps_uncond  # = 0.4

# Pre-compute CFG output for a range of w values (the x-axis)
w_axis = [round(i * 0.5, 1) for i in range(0, 21)]  # 0.0 to 10.0 step 0.5
guided = [eps_uncond + w * text_dir for w in w_axis]

# Slider w values (21 steps from 0 to 10)
slider_ws = [round(i * 0.5, 1) for i in range(0, 21)]
default_idx = 15  # w = 7.5

traces = []

# One trace per slider step: vertical marker showing current (w, guided) point
for i, sw in enumerate(slider_ws):
    g = eps_uncond + sw * text_dir
    traces.append({
        "x": [sw],
        "y": [round(g, 3)],
        "type": "scatter",
        "mode": "markers+text",
        "marker": {"color": "#6366f1", "size": 14, "symbol": "circle",
                   "line": {"color": "#4f46e5", "width": 2}},
        "text": ["w=" + str(sw) + "  output=" + str(round(g, 2))],
        "textposition": "top center",
        "textfont": {"size": 12, "color": "#6366f1"},
        "name": "Current (w=" + str(sw) + ")",
        "showlegend": False,
        "visible": i == default_idx
    })

n_slider_traces = len(traces)

# CFG output line (always visible)
traces.append({
    "x": w_axis,
    "y": [round(g, 3) for g in guided],
    "type": "scatter",
    "mode": "lines",
    "name": "CFG output",
    "line": {"color": "#6366f1", "width": 3},
    "visible": True
})

# Unconditional baseline (always visible)
traces.append({
    "x": w_axis,
    "y": [eps_uncond] * len(w_axis),
    "type": "scatter",
    "mode": "lines",
    "name": "Unconditional (\u03b5 uncond)",
    "line": {"color": "#94a3b8", "width": 2, "dash": "dot"},
    "visible": True
})

# Conditional baseline (always visible)
traces.append({
    "x": w_axis,
    "y": [eps_cond] * len(w_axis),
    "type": "scatter",
    "mode": "lines",
    "name": "Conditional (\u03b5 cond)",
    "line": {"color": "#f59e0b", "width": 2, "dash": "dot"},
    "visible": True
})

n_always = 3  # CFG line + uncond line + cond line

# Under-guided shading (w < 1)
traces.append({
    "x": [0, 1, 1, 0],
    "y": [-0.5, -0.5, 5, 5],
    "fill": "toself",
    "fillcolor": "rgba(251,191,36,0.07)",
    "line": {"width": 0},
    "mode": "lines",
    "name": "Under-guided (w<1)",
    "showlegend": True,
    "visible": True
})

# Amplified guidance shading (w > 1)
traces.append({
    "x": [1, 10, 10, 1],
    "y": [-0.5, -0.5, 5, 5],
    "fill": "toself",
    "fillcolor": "rgba(99,102,241,0.07)",
    "line": {"width": 0},
    "mode": "lines",
    "name": "Amplified (w>1)",
    "showlegend": True,
    "visible": True
})

n_shading = 2

# Build slider steps — toggle which marker trace is visible
steps = []
for i, sw in enumerate(slider_ws):
    vis = [False] * n_slider_traces + [True] * n_always + [True] * n_shading
    vis[i] = True
    steps.append({
        "method": "update",
        "args": [{"visible": vis}],
        "label": str(sw)
    })

layout = {
    "title": {"text": "CFG Output vs Guidance Scale w"},
    "xaxis": {"title": {"text": "Guidance Scale (w)"}, "range": [-0.3, 10.5],
              "dtick": 1},
    "yaxis": {"title": {"text": "Guided Noise Prediction"}, "range": [-0.5, 4.8]},
    "sliders": [{
        "active": default_idx,
        "pad": {"t": 50},
        "currentvalue": {"prefix": "w = ", "visible": True},
        "steps": steps
    }],
    "annotations": [
        {"x": 0, "y": eps_uncond, "text": "w=0: pure unconditional",
         "showarrow": True, "arrowhead": 2, "ax": 60, "ay": -40,
         "font": {"size": 11, "color": "#94a3b8"}},
        {"x": 1, "y": eps_cond, "text": "w=1: standard conditional",
         "showarrow": True, "arrowhead": 2, "ax": 70, "ay": -35,
         "font": {"size": 11, "color": "#f59e0b"}},
        {"x": 7.5, "y": eps_uncond + 7.5 * text_dir,
         "text": "w=7.5: typical SD default",
         "showarrow": True, "arrowhead": 2, "ax": -80, "ay": -30,
         "font": {"size": 11, "color": "#6366f1"}}
    ],
    "showlegend": True,
    "legend": {"x": 0.01, "y": 0.99}
}

js.window.py_plotly_data = json.dumps({"data": traces, "layout": layout})

print(f"Unconditional prediction: {eps_uncond}")
print(f"Conditional prediction:   {eps_cond}")
print(f"Text direction:           {text_dir}")
print()
for sw in [0, 0.5, 1.0, 3.0, 5.0, 7.5, 10.0]:
    g = eps_uncond + sw * text_dir
    label = ""
    if sw == 0:   label = " (pure unconditional)"
    if sw == 1:   label = " (standard conditional)"
    if sw == 7.5: label = " (typical SD default)"
    print(f"  w={sw:5.1f} => guided={g:.2f}{label}")

Notice how at $w = 7.5$ the guided prediction (3.3) has moved far beyond the conditional prediction (0.7) — we're extrapolating, not interpolating. This extrapolation is what produces sharp, text-faithful images, but push $w$ too high and the values explode, causing saturation artifacts.

This is also why Stable Diffusion is slow : every single denoising step requires two forward passes through the U-Net (one conditional, one unconditional). With 20–50 denoising steps, that's 40–100 neural network evaluations per image. Techniques like distillation (training a student model to approximate the CFG-guided output in a single pass) are an active area of research to reduce this cost.

💡 The analogy to temperature in language models is useful. In a language model, low temperature sharpens the probability distribution over tokens (more confident, less diverse); high temperature flattens it (more random, more diverse). In diffusion, high guidance scale $w$ sharpens the distribution over images toward the text prompt (more text-faithful, less diverse); low $w$ produces more varied but less prompt-adherent images. See the temperature article for the formal connection: the softmax temperature controls the entropy of a categorical distribution the same way $w$ controls how tightly the diffusion trajectory tracks the text signal.

Negative Prompts and How They Work

If you've used Stable Diffusion, you've probably seen a "negative prompt" field where people type things like "blurry, low quality, deformed, ugly". How does this actually work? It turns out negative prompts are a simple modification to the CFG formula — instead of using the empty null token $\varnothing$ as the unconditional baseline, we replace it with a negative condition $c_{\text{neg}}$ describing what to avoid:

\tilde{\epsilon}_\theta = \epsilon_\theta(x_t, c_{\text{neg}}) + w \cdot (\epsilon_\theta(x_t, c_{\text{pos}}) - \epsilon_\theta(x_t, c_{\text{neg}}))

This is structurally identical to standard CFG. The only change is swapping $\varnothing$ for $c_{\text{neg}}$. But the effect is powerful: the text direction $(\epsilon_\theta(x_t, c_{\text{pos}}) - \epsilon_\theta(x_t, c_{\text{neg}}))$ now points away from the negative description and toward the positive one. Since we amplify this direction by $w$, the model is simultaneously steered toward what you want and away from what you don't.

Let's trace the boundary cases for the negative prompt formula:

$w = 0$: $\tilde{\epsilon} = \epsilon_\theta(x_t, c_{\text{neg}})$. The model generates images matching the negative prompt. This is the opposite of what you want — you'd get blurry, low-quality images.
$w = 1$: $\tilde{\epsilon} = \epsilon_\theta(x_t, c_{\text{pos}})$. Standard conditional generation with the positive prompt. The negative prompt has no effect.
$w = 7.5$: the positive prompt is amplified and the negative prompt's features are actively suppressed. This is the typical usage.

Common negative prompts include "blurry, low quality, deformed, extra fingers, watermark". These work because the diffusion model has seen such images during training and can predict what noise patterns lead toward them — so it knows which direction to steer away from . The negative prompt doesn't remove features from an image; it changes the trajectory of the entire denoising process from the very first step.

import json, js

# Simulated predictions for one dimension
eps_pos = 0.8   # conditional on positive prompt
eps_neg = 0.2   # conditional on negative prompt
eps_null = 0.4  # unconditional (null prompt)

w = 7.5

# Standard CFG (null baseline)
guided_null = eps_null + w * (eps_pos - eps_null)

# Negative prompt CFG
guided_neg = eps_neg + w * (eps_pos - eps_neg)

# Standard CFG pushes away from the generic average
# Negative prompt CFG pushes away from the negative description
rows = [
    ["Standard CFG (null baseline)", f"{eps_null}", f"{eps_pos}", f"{eps_pos - eps_null:.1f}", f"{guided_null:.1f}"],
    ["Negative prompt CFG", f"{eps_neg}", f"{eps_pos}", f"{eps_pos - eps_neg:.1f}", f"{guided_neg:.1f}"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Mode", "Baseline", "Positive", "Text Direction", "Guided (w=7.5)"],
    "rows": rows
})

print("With a negative prompt, the text direction is LARGER")
print(f"  Null baseline direction:     {eps_pos - eps_null:.1f}")
print(f"  Negative prompt direction:   {eps_pos - eps_neg:.1f}")
print()
print("So the guided output is pushed further from the negative features")

💡 The negative prompt replaces the unconditional baseline, so the text direction becomes larger when the negative prompt describes features far from the positive prompt. This is why specific negative prompts ("blurry, deformed hands") work better than vague ones ("bad") — they create a larger, more targeted direction vector.

Text Encoders: CLIP vs T5

CFG controls how strongly the model follows the text prompt. But the quality of that text signal depends entirely on the text encoder — the model that converts a string of words into the embedding vector $c$ that the diffusion model conditions on. Two architectures dominate this role, and their strengths are complementary.

CLIP (Radford et al., 2021) was trained via contrastive learning on 400 million image-text pairs. The training objective pushed matching image-text pairs together in embedding space and non-matching pairs apart. This gives CLIP a strong sense of which images "go with" which descriptions. Stable Diffusion 1.x and 2.x use CLIP as their text encoder.

But CLIP has limitations. Its contrastive loss optimises for global matching (does this caption describe this image?) rather than fine-grained compositional understanding. It struggles with spatial relationships ("a red cube to the left of a blue sphere"), negation ("a room with no furniture"), and counting ("exactly three cats"). It also has a hard 77-token limit, truncating longer prompts.

T5 (Raffel et al., 2020) is a pure text-to-text transformer trained on a massive text corpus (C4). It was never trained on images, but it understands language deeply: grammar, compositionality, long-range dependencies, and complex instructions. Crucially, it has no fixed token limit for practical purposes and encodes much richer linguistic structure than CLIP.

(Saharia et al., 2022) demonstrated a striking finding in their Imagen paper: scaling the text encoder matters more than scaling the diffusion model itself. Swapping CLIP for T5-XXL (a 4.6B parameter text encoder) produced dramatically better text-image alignment, especially for complex prompts, even when the diffusion U-Net was kept the same size. The bottleneck in text-to-image generation was not the image generator — it was the text understanding.

The trend in newer architectures reflects this finding:

Stable Diffusion 1.x/2.x: CLIP only (OpenCLIP ViT-L or ViT-H). Good at matching aesthetic descriptions but weak on compositional prompts.
Stable Diffusion 3 / SD 3.5: uses three text encoders simultaneously — two CLIP models (OpenCLIP ViT-G and CLIP ViT-L) plus T5-XXL. The CLIP encoders provide image-aligned embeddings while T5 handles complex language understanding.
Flux: uses CLIP ViT-L plus T5-XXL, combining both strengths. T5 handles the heavy lifting for prompt comprehension.
Imagen / Imagen 2: T5-XXL only. Demonstrated that a powerful text encoder alone is sufficient.

For the CFG mechanism, the choice of text encoder determines the quality of the conditional embedding $c$ in $\epsilon_\theta(x_t, c)$. A better text encoder means the conditional prediction more accurately reflects the prompt, which means the text direction $(\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))$ is more precise. In practice, this translates to needing lower guidance scales with better text encoders — the signal is cleaner, so less amplification is needed. Flux models, for instance, often use $w = 3.5$ instead of Stable Diffusion's $w = 7.5$.

💡 The text encoder's embedding is typically injected into the diffusion model via cross-attention layers: the noisy image features attend to the text encoder's output tokens (not just a single pooled vector) so that different spatial regions of the image can attend to different parts of the prompt. This is why token-level text representations (from T5) are more useful than a single sentence embedding.

Quiz

Test your understanding of classifier-free guidance, negative prompts, and text encoders.

In the CFG formula $\tilde{\epsilon}_\theta = \epsilon_\theta(x_t, \varnothing) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \varnothing))$, what happens when $w = 1$?

Pure unconditional generation — the text prompt is completely ignored

Standard conditional generation — equivalent to using the model without any guidance

The model generates images that avoid the text description

Maximum quality — this is the optimal setting for most applications

Why does classifier-free guidance require two forward passes per denoising step?

One pass generates the image and the other evaluates its quality

One pass computes the conditional prediction and the other computes the unconditional prediction, then they are combined

One pass runs the text encoder and the other runs the image decoder

The first pass initialises the noise and the second pass denoises it

How do negative prompts work in Stable Diffusion?

They filter out generated images that match the negative description in a post-processing step

They add a penalty term to the training loss function

They replace the null unconditional baseline in the CFG formula, so the text direction points away from the negative description

They mask out attention heads that correspond to the negative concepts

According to the Imagen paper (Saharia et al., 2022), what has more impact on text-to-image quality?

Increasing the number of denoising steps

Using a larger guidance scale

Scaling up the text encoder (e.g., from CLIP to T5-XXL)

Scaling up the diffusion U-Net