Why Not Diffuse in Pixel Space?
A 512×512 RGB image is a tensor of shape $512 \times 512 \times 3 = 786{,}432$ values. At every single denoising step, the diffusion model must process all of them: feed them in, predict the noise, subtract it, and repeat. How bad does this get? At 1000 denoising steps (a typical DDPM schedule), that is 786 million value-level operations just for one image. Scale the resolution to 1024×1024 and we are at $1024 \times 1024 \times 3 = 3{,}145{,}728$ values per step — over 3 billion operations across the full chain.
But here is the key insight: most pixel-level detail is perceptually redundant . Neighboring pixels in a photograph are highly correlated — a patch of blue sky does not carry 64 independent pieces of information just because it spans 64 pixels. The high-frequency texture details (exact noise grain, subtle color gradients) that consume most of those 786K values contribute very little to what we actually perceive as the "content" of the image.
This is the same principle behind JPEG compression: an image can be compressed to a fraction of its raw size with negligible perceptual loss, precisely because pixel space is massively overcomplete. So what if we first compressed the image to a much smaller representation, ran the entire diffusion process there , and then decompressed the result back to pixels? That is exactly what latent diffusion does.
The Variational Autoencoder (VAE)
The compression engine behind latent diffusion is a Variational Autoencoder (VAE) (Rombach et al., 2022) . An autoencoder is a neural network with two halves: an encoder that compresses an input into a compact internal representation (the "bottleneck"), and a decoder that reconstructs the original input from that compact form. The "variational" part means the encoder does not output a single fixed vector but rather the parameters of a probability distribution (a mean and variance), from which we sample the bottleneck representation. This makes the latent space smooth and continuous — nearby points decode to similar images — which is crucial for diffusion to work well.
The VAE has two components. The encoder $\mathcal{E}$ maps an image to a latent representation:
For a 512×512 input image, the encoder typically produces a $64 \times 64 \times 4$ latent — spatial dimensions reduced by 8× in each direction, with 4 latent channels instead of 3 color channels. The decoder $\mathcal{D}$ inverts this:
The compression ratio is dramatic. The original image has $512 \times 512 \times 3 = 786{,}432$ values. The latent has $64 \times 64 \times 4 = 16{,}384$ values. That is a 48× reduction . The diffusion model now operates on 16K values per step instead of 786K.
# Compression ratio: pixel space vs latent space
pixel_h, pixel_w, pixel_c = 512, 512, 3
latent_h, latent_w, latent_c = 64, 64, 4
pixel_values = pixel_h * pixel_w * pixel_c
latent_values = latent_h * latent_w * latent_c
ratio = pixel_values / latent_values
print(f"Pixel space: {pixel_h}x{pixel_w}x{pixel_c} = {pixel_values:,} values")
print(f"Latent space: {latent_h}x{latent_w}x{latent_c} = {latent_values:,} values")
print(f"Compression ratio: {ratio:.0f}x fewer values")
print()
# At 1024x1024
pixel_1024 = 1024 * 1024 * 3
latent_1024 = 128 * 128 * 4
ratio_1024 = pixel_1024 / latent_1024
print(f"At 1024x1024:")
print(f" Pixel space: {pixel_1024:,} values")
print(f" Latent space: {latent_1024:,} values")
print(f" Compression ratio: {ratio_1024:.0f}x")
The VAE is trained with a combined loss that balances two objectives:
The first term is the reconstruction loss : the mean squared error between the original image $x$ and the decoded reconstruction $\mathcal{D}(\mathcal{E}(x))$. This is what forces the autoencoder to actually preserve information. If this term is zero, the decoder perfectly reconstructs every pixel. If it is large, the bottleneck is too aggressive and loses critical detail.
The second term is the KL divergence regularizer (see information theory for a deeper treatment of KL divergence). Here $q(z|x)$ is the distribution the encoder outputs (a Gaussian with learned mean and variance for each input $x$), and $p(z)$ is a standard Gaussian prior $\mathcal{N}(0, I)$. This term penalises the encoder for producing latent distributions that deviate from a standard Gaussian.
Why does this matter? Without KL regularization, the encoder can "cheat" by spreading latent codes far apart in arbitrary regions of the space, leaving vast dead zones where decoding produces garbage. The KL term forces the latent space to be compact and Gaussian-shaped, so that any point sampled from $\mathcal{N}(0, I)$ decodes to something reasonable. Let's check the boundary behavior: when $q(z|x)$ exactly equals $p(z)$ (encoder outputs a standard Gaussian regardless of input), $D_{\text{KL}} = 0$ — no penalty, but also no information about $x$ is encoded, so reconstruction is terrible. When $q(z|x)$ is a very narrow spike far from zero (encoder memorises each input as a unique point), $D_{\text{KL}}$ grows large — heavy penalty. The optimum lies between: encode enough to reconstruct, but stay close to Gaussian.
In practice, the weight $\lambda$ is set very small (around $10^{-6}$). Perceptual reconstruction quality matters far more than perfect Gaussianity — the goal is a useful image compressor, not a generative model in the VAE sense. Stable Diffusion's VAE also adds a perceptual loss (comparing VGG features of real vs. reconstructed images) and an adversarial loss (a discriminator that penalises blurry reconstructions), but the formula above captures the core idea.
Diffusion in Latent Space
With a trained VAE in hand, the entire diffusion process moves from pixel space to latent space. Instead of working with $x_0 \in \mathbb{R}^{512 \times 512 \times 3}$, we encode the training image and work with its latent:
The forward process adds Gaussian noise to $z_0$ using the exact same mathematics as standard diffusion (covered in article 1), just applied to the latent rather than the pixel tensor:
The noise schedule $\bar{\alpha}_t$ works identically: at $t=0$, $\bar{\alpha}_0 \approx 1$ so $z_0$ is nearly clean; at $t=T$, $\bar{\alpha}_T \approx 0$ so $z_T$ is nearly pure noise. The model $\epsilon_\theta(z_t, t)$ is trained to predict the noise $\epsilon$ that was added, using the same MSE objective:
At inference time, we start from pure noise $z_T \sim \mathcal{N}(0, I)$ in the latent space, iteratively denoise to recover $z_0$, then decode back to pixels:
The benefits of working in latent space go beyond raw speed:
- 48× fewer values: every forward pass through the denoising network is dramatically cheaper. Training converges faster and sampling requires less compute per step.
- Smoother, more semantic space: the VAE encoder strips away perceptually redundant pixel correlations. The latent space is more "meaning-dense" — small movements correspond to meaningful visual changes rather than imperceptible pixel shifts.
- Division of labor: the VAE handles low-level perceptual details (textures, fine patterns, exact pixel values) while the diffusion model focuses on high-level semantic structure (composition, objects, style). Each component does what it is best at.
The U-Net Denoiser
The neural network $\epsilon_\theta$ that predicts noise in latent space is, in the original Stable Diffusion, a U-Net . The U-Net was originally designed for medical image segmentation (Ronneberger et al., 2015) , but its structure turns out to be ideal for denoising. It is an encoder-decoder architecture with skip connections.
The encoder path progressively downsamples the spatial dimensions through a series of residual convolution blocks followed by downsampling operations. A $64 \times 64$ latent might pass through resolutions $64 \to 32 \to 16 \to 8$. As spatial resolution shrinks, the channel count grows, so the network captures increasingly global, abstract features at each level.
The decoder path reverses this, progressively upsampling back to the original latent resolution: $8 \to 16 \to 32 \to 64$. But upsampling alone would lose fine spatial detail. That is where skip connections come in: at each scale, the encoder's feature maps are concatenated with the decoder's feature maps. The encoder says "here is what the fine detail looks like at this resolution" and the decoder uses that alongside its upsampled global context to produce a refined output.
Two additional mechanisms are critical. First, timestep conditioning : the model must know which timestep $t$ it is currently denoising, because the noise level (and therefore the denoising strategy) differs dramatically between $t=1000$ (nearly pure noise, focus on global structure) and $t=10$ (nearly clean, focus on fine detail). The timestep is encoded using a sinusoidal embedding (similar to positional encoding in transformers), projected through an MLP, and added to each residual block.
Second, self-attention layers are inserted at certain resolutions (typically $32 \times 32$, $16 \times 16$, and $8 \times 8$). Convolutions are local operations — each output pixel only sees a small neighborhood. Self-attention lets every spatial position attend to every other, giving the model global context. This is essential for coherent structure: ensuring that a face has two eyes, that a building's windows are evenly spaced, that the overall composition is consistent.
Text Conditioning via Cross-Attention
So far, the denoiser generates images unconditionally — it can denoise, but it has no idea what we want it to produce. To generate images from text prompts, we need a way to inject language information into the U-Net. The mechanism is cross-attention (see encoder-decoder attention for the general concept).
First, the text prompt is processed by a separate, frozen text encoder (like CLIP or T5) that converts the string into a sequence of embedding vectors $\tau \in \mathbb{R}^{L \times d_{\text{text}}}$, where $L$ is the number of tokens and $d_{\text{text}}$ is the embedding dimension. These text embeddings are then injected into the U-Net at multiple resolutions via cross-attention:
where $Q = W_Q \cdot \phi(z_t)$ comes from the image features (the U-Net's intermediate activations at a given layer), and $K = W_K \cdot \tau$ and $V = W_V \cdot \tau$ come from the text embeddings. The $\sqrt{d_k}$ denominator prevents the dot products from growing too large as the dimension $d_k$ increases: without it, for high $d_k$ the dot products would be so large that softmax saturates to near-one-hot vectors, killing gradient flow. With the scaling, the variance of $QK^T$ stays approximately 1 regardless of $d_k$.
What this means intuitively: each spatial position in the image features queries the text description, asking "which words are relevant to what I should generate here?" A spatial position corresponding to the sky region will attend strongly to the word "sunset"; a position near a face will attend to "portrait" or "woman". The model learns which text tokens matter for which spatial locations.
Cross-attention layers appear at multiple resolutions in the U-Net (typically alongside the self-attention layers at $32 \times 32$, $16 \times 16$, and $8 \times 8$). Lower resolutions capture coarse text-to-layout alignment ("a cat on the left, a dog on the right"), while higher resolutions capture fine-grained detail ("blue eyes", "striped fur").
The choice of text encoder matters enormously for prompt understanding. Different versions of Stable Diffusion use different text encoders:
- SD 1.x: CLIP ViT-L/14 — 77 token maximum context, 768-dimensional embeddings. The original, but limited in vocabulary and context length.
- SD 2.x: OpenCLIP ViT-H/14 — 1024-dimensional embeddings, trained on a larger dataset. Better text understanding, but the community found it harder to prompt due to differences in training data.
- SDXL: dual text encoders — CLIP ViT-L and OpenCLIP ViT-bigG, with their outputs concatenated to produce 2048-dimensional embeddings (Podell et al., 2023) . Two encoders capture complementary aspects of the text: one trained on curated image-text pairs (CLIP), the other on a broader dataset (OpenCLIP).
import json, js
versions = [
("SD 1.5", "768", "77", "1"),
("SD 2.1", "1024", "77", "1"),
("SDXL", "2048", "77", "2"),
]
js.window.py_table_data = json.dumps({
"headers": ["Version", "Embed Dim", "Max Tokens", "Encoders"],
"rows": [list(v) for v in versions]
})
print("SDXL's 2048-dim conditioning = 2.7x richer than SD 1.5's 768-dim.")
Stable Diffusion: The Full Architecture
Putting everything together, the Stable Diffusion pipeline consists of three independently trained components working in sequence:
- Text encoder (frozen): converts the text prompt into a sequence of embedding vectors. Trained separately (e.g., CLIP was trained on 400M image-text pairs via contrastive learning).
- U-Net denoiser (the core model): takes a noisy latent $z_t$ and timestep $t$, receives text embeddings via cross-attention, and predicts the noise $\epsilon$. This is the only component that is trained during the latent diffusion training process.
- VAE decoder (frozen): converts the final denoised latent $z_0$ back to a pixel image. Trained separately as described above.
The inference pipeline flows like this:
# Stable Diffusion inference (pseudocode)
# 1. Encode text prompt
text_embeddings = text_encoder(prompt) # shape: (77, 768) for SD 1.5
# 2. Start from pure noise in latent space
z_T = torch.randn(1, 4, 64, 64) # random latent for 512x512 output
# 3. Iterative denoising (e.g., 50 steps with a DDIM scheduler)
z_t = z_T
for t in reversed(scheduler.timesteps): # T, T-1, ..., 1
noise_pred = unet(z_t, t, text_embeddings) # predict noise
z_t = scheduler.step(noise_pred, t, z_t) # remove predicted noise
# 4. Decode latent to pixel image
image = vae.decode(z_t) # shape: (1, 3, 512, 512)
Notice how the three components have completely different parameter counts and training procedures, yet work together at inference:
import json, js
models = [
("SD 1.5", 860, 123, 84, "512x512", "~4 GB"),
("SD 2.1", 865, 354, 84, "768x768", "~5 GB"),
("SDXL", 2600, 817, 84, "1024x1024", "~7 GB"),
]
rows = []
for name, unet, text, vae, res, vram in models:
total = unet + text + vae
rows.append([name, f"{unet:,}", f"{text:,}", f"{vae}", f"{total:,}", res, vram])
js.window.py_table_data = json.dumps({
"headers": ["Model", "U-Net (M)", "Text (M)", "VAE (M)", "Total (M)", "Native Res", "VRAM FP16"],
"rows": rows
})
print("SD 1.5 runs on 8 GB consumer GPUs — this democratized image generation.")
Why did Stable Diffusion democratize image generation? Before it, text-to-image models like DALL-E 2 were closed-source and API-only. Stable Diffusion was released with open-source weights , and because latent diffusion reduced the compute requirements so dramatically, the model could run on a consumer GPU with just 8 GB of VRAM. This enabled an explosion of community development: custom fine-tuned models, LoRA adaptations, ControlNet for structural conditioning, and entirely new interfaces and workflows — none of which would have been possible with a closed, pixel-space diffusion model requiring hundreds of gigabytes of VRAM.
Quiz
Test your understanding of latent diffusion and the VAE.
Why does latent diffusion operate in a compressed latent space rather than directly on pixels?
In the VAE loss $\mathcal{L}_{\text{VAE}} = \|x - \mathcal{D}(\mathcal{E}(x))\|^2 + \lambda \, D_{\text{KL}}(q(z|x) \| p(z))$, why is the KL weight $\lambda$ set very small (around $10^{-6}$)?
In cross-attention for text conditioning, what serves as the queries (Q) and what serves as the keys/values (K, V)?
Which three components of Stable Diffusion are trained independently?