What Changed Between SDXL and SD3?

SDXL was the peak of the U-Net era. It used a convolutional U-Net as its denoiser, a DDPM noise schedule to control the forward and reverse diffusion process, two CLIP text encoders (CLIP-L and OpenCLIP-G) for text understanding, and cross-attention layers to inject text conditioning into the image pathway. Text tokens served as keys and values, image tokens served as queries, and these cross-attention layers were inserted periodically throughout the U-Net. Between cross-attention layers, image features processed alone through convolutions.

Stable Diffusion 3 (Esser et al., 2024) changed almost everything. The U-Net was replaced by a Multimodal Diffusion Transformer (MMDiT) . The DDPM noise schedule was replaced by rectified flow matching (straight-line paths from noise to data instead of the curved DDPM trajectories). The dual text encoders were expanded to triple text encoders : CLIP-L, CLIP-G, and T5-XXL. And cross-attention was replaced by joint multimodal attention , where text and image tokens enter the same self-attention computation at every layer.

The only constant: the VAE . The convolutional autoencoder that compresses pixel-space images into latents and decodes them back remained the same general design (and even that was improved — SD3 shipped a higher-quality VAE with 16 latent channels instead of 4). Everything else was rebuilt from scratch.

import json, js

# What changed from SDXL to SD3
components = [
    ("Denoiser",        "U-Net (conv)",         "MMDiT (transformer)"),
    ("Noise schedule",  "DDPM (curved)",        "Rectified flow (straight)"),
    ("Text encoders",   "CLIP-L + OpenCLIP-G",  "CLIP-L + CLIP-G + T5-XXL"),
    ("Text conditioning","Cross-attention",      "Joint multimodal attention"),
    ("VAE channels",    "4",                     "16"),
    ("Param count",     "~3.5B (U-Net)",        "2B / 8B (MMDiT)"),
]

rows = []
for comp, sdxl, sd3 in components:
    rows.append([comp, sdxl, sd3])

js.window.py_table_data = json.dumps({
    "headers": ["Component", "SDXL", "SD3"],
    "rows": rows
})

print("Almost nothing survived the transition unchanged.")

MMDiT: Multimodal Diffusion Transformer

The previous article covered the Diffusion Transformer (DiT) , which replaced the U-Net with a transformer but was only class-conditional (it generated ImageNet classes, not arbitrary text descriptions). MMDiT (Esser et al., 2024) solves text conditioning with a single, elegant idea: put the text tokens and image tokens into the same attention computation . Not through cross-attention, where text appears only as keys/values. Not through a separate pathway. Into the same sequence, through the same softmax.

Here is exactly what happens in each MMDiT block. The network receives two streams of tokens: text tokens $x_{\text{txt}} \in \mathbb{R}^{N_{\text{txt}} \times d_{\text{txt}}}$ and image tokens $x_{\text{img}} \in \mathbb{R}^{N_{\text{img}} \times d_{\text{img}}}$, where $N_{\text{txt}}$ and $N_{\text{img}}$ are the respective sequence lengths and $d_{\text{txt}}$, $d_{\text{img}}$ are their feature dimensions (which may differ because text comes from a language encoder and images come from a VAE).

Step 1: Separate QKV projections. Each modality has its own learned projection matrices that map into a shared attention dimension $d_k$:

$$Q_{\text{txt}} = x_{\text{txt}} W_Q^{\text{txt}}, \quad K_{\text{txt}} = x_{\text{txt}} W_K^{\text{txt}}, \quad V_{\text{txt}} = x_{\text{txt}} W_V^{\text{txt}}$$
$$Q_{\text{img}} = x_{\text{img}} W_Q^{\text{img}}, \quad K_{\text{img}} = x_{\text{img}} W_K^{\text{img}}, \quad V_{\text{img}} = x_{\text{img}} W_V^{\text{img}}$$

Why separate projections? Text features (from T5-XXL, dimension 4096) and image features (from the VAE, dimension dependent on the model) live in different representation spaces. Separate $W_Q$, $W_K$, $W_V$ matrices let each modality map into a shared attention space of dimension $d_k$ in its own way. After projection, $Q_{\text{txt}}$, $K_{\text{txt}}$, $V_{\text{txt}}$, $Q_{\text{img}}$, $K_{\text{img}}$, and $V_{\text{img}}$ all have the same dimensionality.

Step 2: Concatenate. The queries, keys, and values from both modalities are concatenated along the sequence dimension:

$$Q = [Q_{\text{txt}}; \; Q_{\text{img}}], \quad K = [K_{\text{txt}}; \; K_{\text{img}}], \quad V = [V_{\text{txt}}; \; V_{\text{img}}]$$

If the text has $N_{\text{txt}} = 77$ tokens and the image has $N_{\text{img}} = 1024$ tokens (a 32x32 latent with 2x2 patches), the concatenated sequence has $N = 77 + 1024 = 1101$ tokens. The attention cost is $\mathcal{O}(N^2 \cdot d_k)$, so this is more expensive than attending only to image tokens ($1024^2$) — but the joint computation lets every token attend to every other token regardless of modality.

Step 3: Standard self-attention. Run the usual scaled dot-product attention on the concatenated sequence:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$

The attention matrix has shape $(N_{\text{txt}} + N_{\text{img}}) \times (N_{\text{txt}} + N_{\text{img}})$. Every image token attends to every text token, and every text token attends to every image token. There is no masking — both modalities see each other fully. When $d_k$ is large, $\frac{QK^\top}{\sqrt{d_k}}$ produces moderate logits and softmax distributes attention broadly. When $d_k$ is small, logits are larger and attention becomes sharper. The $\sqrt{d_k}$ normalisation ensures gradients remain stable regardless of the projection dimension.

Step 4: Split and process separately. The attention output is split back into text and image components along the sequence dimension. Each component then passes through its own separate MLP (feed-forward network), since the two modalities may need different post-attention transformations.

This design has a crucial consequence that is easy to miss: the text tokens get updated by attending to image tokens . In the cross-attention approach used by SD 1.x / 2.x / XL, the text encoder runs once, produces a fixed embedding, and that embedding is used unchanged throughout the entire denoising process. The text representation is static . In MMDiT, the text representation evolves during denoising. At each layer, text tokens absorb information from image tokens and image tokens absorb information from text tokens. By the final layer, the text representation has been shaped by the image content and vice versa. This bidirectional integration at every layer is what distinguishes MMDiT from simpler conditioning approaches.

💡 In cross-attention (SDXL), text tokens form the keys and values but never appear as queries — they provide information but never receive it. In MMDiT, text tokens are both queries AND keys/values. They attend to image tokens and get modified by what they see. This means the model can, for example, learn that the word "red" in the prompt should strengthen its association with a particular region of the image as denoising progresses.

SD3 ships in several sizes: SD3-Medium (2B parameters), SD3.5-Large (8B parameters), and SD3.5-Turbo (a distilled variant of SD3.5-Large optimised for fewer denoising steps). The scaling from 2B to 8B follows the same log-linear improvement pattern that DiT demonstrated: more parameters in the transformer systematically improve image quality.

Triple Text Encoders

SD 1.5 used one text encoder (CLIP-L). SDXL used two (CLIP-L + OpenCLIP-G). SD3 uses three : CLIP ViT-L/14, CLIP ViT-bigG, and T5-XXL. Why keep adding encoders? Because each contributes something different.

CLIP ViT-L/14 produces 768-dimensional embeddings. It was trained with contrastive image-text matching on hundreds of millions of image-caption pairs. Its strength is aligning visual concepts with short descriptive phrases — it understands "golden retriever", "sunset over ocean", and "oil painting" well because these are the kinds of captions it was trained on.

CLIP ViT-bigG produces 1280-dimensional embeddings. It is a larger CLIP model trained on broader data. It captures a wider range of visual concepts and handles more nuanced descriptions than CLIP-L, but both CLIP models share a fundamental limitation: they produce a pooled embedding (a single vector summarising the entire prompt) and have a 77-token context window. Long, complex prompts get truncated and compressed into a single vector.

T5-XXL produces 4096-dimensional embeddings and operates entirely differently. It is a pure language model (an encoder-decoder transformer with 4.7B parameters), not trained on images at all. What it brings is deep language understanding : compositional reasoning ("a red cube on top of a blue sphere"), spatial relationships ("to the left of"), negation ("without glasses"), counting ("three cats"), and long-prompt comprehension (its context window is 512 tokens, not 77). Crucially, T5 produces sequential embeddings — one vector per token, preserving the structure of the prompt.

How are these three encodings used? They serve two different conditioning pathways:

  • Pooled CLIP embeddings -> AdaLN conditioning. The pooled vectors from both CLIP models are concatenated into a single vector $c_{\text{pool}} \in \mathbb{R}^{2048}$ (768 + 1280). This vector is fed into AdaLN-Zero, the same mechanism DiT uses for timestep conditioning. It modulates the scale and shift of layer normalisation in every MMDiT block. This provides a global conditioning signal — the overall style, mood, and visual concept of the prompt.
  • T5 token embeddings -> sequence tokens in MMDiT. The per-token embeddings from T5-XXL are projected to the MMDiT's hidden dimension and used as the text tokens in the joint attention computation. These are the tokens that get concatenated with image tokens, allowing the model to attend to specific words and phrases at specific positions in the prompt. This provides fine-grained conditioning — which object goes where, what colour it is, and how it relates to other objects.

This dual pathway gives SD3 both visual alignment (from CLIP, which was trained to match images and text) and deep language understanding (from T5, which was trained on massive text corpora). CLIP tells the model what a "sunset" looks like. T5 tells the model that "a cat sitting on a mat beside two dogs" means exactly one cat, one mat, and two dogs, with specific spatial arrangements.

import json, js

# SD3 text encoder comparison
encoders = [
    ("CLIP ViT-L/14",  768,  77,  "Pooled (1 vector)",   "Visual concepts, short captions"),
    ("CLIP ViT-bigG",  1280, 77,  "Pooled (1 vector)",   "Broader visual understanding"),
    ("T5-XXL",         4096, 512, "Sequential (per-token)", "Compositional language, long prompts"),
]

rows = []
for name, dim, ctx, out_type, strength in encoders:
    rows.append([name, str(dim), str(ctx), out_type, strength])

js.window.py_table_data = json.dumps({
    "headers": ["Encoder", "Dim", "Ctx", "Output Type", "Strength"],
    "rows": rows
})

print("Pooled CLIP vectors (768 + 1280 = 2048-dim) -> AdaLN global conditioning")
print("T5 sequential tokens (4096-dim each)          -> MMDiT joint attention tokens")
print("\nTotal text encoder parameters: ~5.1B (vs ~700M for SDXL's dual CLIP)")
💡 T5-XXL alone has 4.7B parameters, more than the entire SD3-Medium denoiser (2B). The text encoders are frozen during SD3 training — their weights are never updated. They serve purely as feature extractors, converting text prompts into representations the MMDiT can work with. At inference time, you can drop T5-XXL entirely (at the cost of reduced prompt understanding) to save VRAM.

Flux: Single-Stream and Double-Stream Blocks

Flux (Black Forest Labs, 2024) was built by the same researchers who created Stable Diffusion — they left Stability AI and founded Black Forest Labs . Flux takes the MMDiT concept from SD3 and pushes it further with a hybrid architecture that transitions from modality-specific processing to full modality unification as tokens move through the network.

The Flux transformer has two types of blocks:

Double-stream blocks (early layers). These are structurally identical to MMDiT blocks. Text and image tokens have separate QKV projections, separate MLPs, and separate layer norms. They share only the attention computation (concatenated QKV, single softmax). The two modalities maintain their own processing pathways while exchanging information through attention. These blocks preserve modality-specific features while enabling cross-modal communication.

Single-stream blocks (later layers). Here, text and image tokens are fully concatenated into one unified sequence. There is no longer a distinction between text projections and image projections — both modalities share the same $W_Q$, $W_K$, $W_V$ matrices and the same MLP. The tokens are treated as a single, modality-agnostic sequence.

The transition from double-stream to single-stream encodes a hypothesis about how modalities should be fused: early layers need modality-specific processing because text and image representations start in very different spaces (text from T5, images from a VAE). Separate projections in early layers let each modality adapt its raw features into a form suitable for joint attention. By the later layers, the representations have been sufficiently aligned through repeated cross-modal attention that separate projections become redundant — a single shared set of weights can handle both modalities.

Beyond this hybrid block design, Flux introduces several architectural innovations:

  • Rotary Position Embeddings (RoPE) for images. RoPE, originally designed for 1D text sequences, is extended to 2D for image tokens. Each image token gets a positional encoding based on its $(row, col)$ position in the latent grid. This lets the model generalise to different image resolutions more gracefully than fixed sinusoidal positional embeddings.
  • Parallel attention. Instead of computing attention and then the FFN sequentially (the standard transformer layout), Flux computes them in parallel and sums the results. Given input $x$, a standard block computes $x + \text{FFN}(x + \text{Attn}(x))$, while a parallel block computes $x + \text{Attn}(x) + \text{FFN}(x)$. This was introduced by PaLM (Chowdhery et al., 2022) and offers a practical throughput advantage: the attention and FFN computations can be overlapped on the GPU, reducing wall-clock time per layer.
  • Guidance distillation. Classifier-free guidance (CFG) requires two forward passes per denoising step: one conditioned and one unconditioned. This doubles inference compute. Flux trains a distilled student model that approximates the CFG-guided output in a single forward pass. The student learns to predict what the teacher (the full CFG pipeline) would produce, eliminating the 2x compute overhead. Flux.1-schnell uses this distillation to generate images in as few as 4 denoising steps.

The parallel attention formulation is worth understanding precisely. In the standard serial layout:

$$h = x + \text{Attn}(\text{Norm}_1(x))$$
$$\text{out} = h + \text{FFN}(\text{Norm}_2(h))$$

The FFN must wait for the attention output $h$ before it can begin. In the parallel layout:

$$\text{out} = x + \text{Attn}(\text{Norm}_1(x)) + \text{FFN}(\text{Norm}_2(x))$$

Both branches read the same input $x$, so they can execute simultaneously. The price is a slight quality degradation (the FFN no longer benefits from seeing the attention-updated representation), but at scale this is negligible and the throughput gain is significant.

Flux ships in three variants: Flux.1-dev (12B parameters, high quality, open-weight), Flux.1-schnell (distilled, generates in 4 steps, Apache-2.0 licensed), and Flux.1-pro (commercial API-only variant). At 12B parameters, Flux.1-dev is among the largest open text-to-image models, and the quality improvement over SD3-Medium (2B) is substantial.

# Flux architecture: block types
print("Flux.1-dev Architecture (12B parameters)")
print("=" * 60)
print()

# Approximate architecture based on public information
double_stream = 19
single_stream = 38
hidden_dim = 3072
heads = 24

print(f"Double-stream blocks (MMDiT-style):  {double_stream}")
print(f"Single-stream blocks (unified):      {single_stream}")
print(f"Total transformer blocks:            {double_stream + single_stream}")
print(f"Hidden dimension:                    {hidden_dim}")
print(f"Attention heads:                     {heads}")
print()
print("Token flow through the network:")
print("-" * 60)
print(f"Layers  1-{double_stream}:  text and image have SEPARATE projections")
print(f"         (double-stream: separate W_Q, W_K, W_V per modality)")
print(f"         Both modalities share the attention computation")
print()
print(f"Layers {double_stream+1}-{double_stream + single_stream}: text and image share ALL projections")
print(f"         (single-stream: one unified W_Q, W_K, W_V)")
print(f"         Modalities are treated as one sequence")
print()
print("Hypothesis: early layers align different representation spaces,")
print("later layers operate on an already-unified representation.")

What Actually Improved?

Architectural changes are only interesting if they produce visible improvements. What can SD3 and Flux do that SDXL could not?

Text rendering. This is the most dramatic improvement. Ask SDXL to generate an image containing the word "HELLO" and you will typically get garbled, semi-legible characters. SD3 and Flux can generate readable text in images — signs, book covers, T-shirts with text, and storefronts with correct lettering. Why? The T5-XXL encoder understands text at the character level. CLIP encoders, trained on image-text pairs, treat words as visual concepts ("a word that looks like HELLO"). T5, trained on massive text corpora, understands the actual sequence of characters and can communicate this to the denoiser through the fine-grained sequential embeddings in MMDiT's joint attention.

Compositional prompts. "A red cube on a blue sphere" sounds simple, but SDXL frequently confused which attributes belonged to which object — producing a blue cube on a red sphere, or a purple object that seemed to blend both. This is the attribute binding problem: correctly associating adjectives with their nouns. T5's language understanding combined with joint attention (where colour tokens can directly attend to specific object tokens at every layer) dramatically improves attribute binding.

Prompt adherence. Write a 50-word detailed prompt for SDXL and many details will be ignored. SDXL's CLIP encoders have a 77-token context window — everything beyond that is truncated. T5-XXL supports 512 tokens. Combined with the richer sequential representation (per-token embeddings instead of a pooled vector), SD3 and Flux follow long, detailed prompts more faithfully.

Fewer artifacts. Hands with the wrong number of fingers, faces with distorted features, and objects with impossible geometry — these persistent diffusion artifacts are reduced (though not eliminated) in SD3 and Flux. The global attention at every layer (every image patch attends to every other image patch) helps maintain structural coherence across the entire image, rather than relying on the U-Net's limited receptive field at high resolutions.

What T5 specifically contributes can be summarised as understanding of: spatial relationships ("above", "behind", "to the left of"), negation ("without a hat" — CLIP struggles with negation because its contrastive training does not distinguish "with" from "without" very well), counting ("three cats" more reliably produces three cats, not two or four), and attribute binding ("a tall man in a red shirt standing next to a short woman in a blue dress").

import json, js

# Capability comparison across SD generations
capabilities = [
    ("Text rendering",      "None",     "Poor",     "Good",     "Excellent"),
    ("Attribute binding",   "Poor",     "Fair",     "Good",     "Very Good"),
    ("Prompt adherence",    "Fair",     "Fair",     "Good",     "Excellent"),
    ("Spatial reasoning",   "Poor",     "Poor",     "Fair",     "Good"),
    ("Negation handling",   "None",     "Poor",     "Fair",     "Fair"),
    ("Counting accuracy",   "Poor",     "Poor",     "Fair",     "Good"),
    ("Hand/face quality",   "Poor",     "Fair",     "Good",     "Very Good"),
    ("Max text tokens",     "77",       "77",       "77+512",   "77+512"),
    ("Native resolution",   "512",      "1024",     "1024",     "1024+"),
    ("Denoiser params",     "~860M",    "~3.5B",    "2-8B",     "12B"),
]

rows = []
for cap, sd15, sdxl, sd3, flux in capabilities:
    rows.append([cap, sd15, sdxl, sd3, flux])

js.window.py_table_data = json.dumps({
    "headers": ["Capability", "SD 1.5", "SDXL", "SD3", "Flux"],
    "rows": rows
})

print("Key enablers of improvement:")
print("  T5-XXL encoder   -> text rendering, compositional understanding")
print("  Joint attention   -> attribute binding, spatial coherence")
print("  Flow matching     -> fewer steps needed, cleaner denoising")
print("  Transformer scale -> global coherence (no conv receptive field limits)")
💡 You can run SD3 without T5-XXL to save VRAM (T5-XXL alone requires ~8GB in float16). The model still works — CLIP provides enough signal for simple prompts. But complex prompts, text rendering, and precise attribute binding degrade noticeably. T5 is what turns SD3 from "slightly better SDXL" into a qualitative leap in prompt understanding.

Quiz

Test your understanding of the SD3 and Flux architectures.

What is the key difference between cross-attention (SDXL) and joint attention (MMDiT) for text conditioning?

Why does SD3 use separate QKV projections for text and image tokens in MMDiT blocks?

What is the primary role of the T5-XXL encoder in SD3, compared to the two CLIP encoders?

Why does Flux use double-stream blocks in early layers and single-stream blocks in later layers?