SDXL and Cascaded Architectures

Why Did SD 1.5 Struggle at High Resolution?

Stable Diffusion 1.5 was trained at 512x512 resolution . It worked remarkably well within that frame, but users wanted 1024x1024 images. What happens when you simply double the canvas? The model produces repeated objects, tiled patterns, and anatomical distortions. Faces appear twice. Buildings duplicate. The composition falls apart.

The reason is architectural, not a matter of diffusion quality. The U-Net backbone was trained on 512x512 latents (64x64 in the latent space after the VAE's 8x downsampling). At inference time, asking it to denoise a 128x128 latent (corresponding to 1024x1024 pixels) means every convolutional layer and attention layer sees spatial dimensions it has never encountered during training. The positional structure the network learned — where objects tend to appear, how compositions are arranged — is calibrated for 64x64 feature maps, not 128x128. The model doesn't "know" that the larger canvas is one coherent image; it treats patches of it like separate 64x64 windows, producing the telltale duplication artifacts.

Two architectural strategies emerged to solve this. The first: train a much bigger model natively at higher resolution. That's SDXL . The second: generate a small image first, then progressively upscale it with specialised super-resolution models. That's the cascaded approach, pioneered by Imagen. Both strategies ended up converging — SDXL itself uses a two-stage cascade with its refiner model.

SDXL: Scaling Up the U-Net

SDXL (Podell et al., 2023) took the brute-force approach: make the U-Net much larger and train it natively at 1024x1024. The result is a 3.5 billion parameter U-Net, roughly 4x larger than SD 1.5's ~860M parameter backbone. But the changes go well beyond parameter count.

The first major change is the dual text encoder . SD 1.5 used a single CLIP-L text encoder producing 768-dimensional embeddings. SDXL concatenates the outputs of two text encoders: CLIP-L (768-dim) and OpenCLIP-G (1280-dim). The concatenated conditioning vector is:

c_{\text{text}} = [\text{CLIP-L}(\text{prompt}); \; \text{OpenCLIP-G}(\text{prompt})] \in \mathbb{R}^{77 \times 2048}

Here 77 is the token sequence length and 2048 = 768 + 1280 is the combined embedding dimension. Why two encoders? Each was trained on different data distributions with different objectives. CLIP-L is strong on short, descriptive captions. OpenCLIP-G was trained on LAION data with longer, more detailed descriptions. Concatenating them gives the U-Net a richer text signal — the model can attend to whichever encoder's representation is more informative for a given aspect of the prompt. At the boundaries: if you set the OpenCLIP-G embeddings to zero (equivalent to using only CLIP-L), image quality degrades noticeably, confirming that the larger encoder contributes meaningful signal beyond what CLIP-L captures alone.

The second major innovation is micro-conditioning . Real training datasets contain images at wildly different resolutions, and common practice is to resize and crop them to a fixed training resolution. This introduces two problems: (1) the model learns the artifacts of downsampled images (blur, loss of detail), and (2) the model learns the bias of cropped images (objects cut off at edges, centered compositions). SDXL solves both by encoding the original image metadata as additional conditioning inputs.

For original size conditioning , the model receives the height and width of the source image before any resizing. These are encoded via sinusoidal embeddings (the same kind used for timesteps in diffusion models) and added to the timestep embedding:

e_{\text{size}} = \text{MLP}\bigl(\text{fourier}(h_{\text{orig}}) \| \text{fourier}(w_{\text{orig}})\bigr)

💡 Why does this help? Consider two training images both cropped to 1024x1024: one was originally 4000x3000 (a high-res photo) and the other was 256x256 (a low-quality thumbnail upscaled). Without size conditioning, the model sees identical 1024x1024 inputs and must learn to produce outputs that average both quality levels. With size conditioning, it can learn that $h_{\text{orig}} = 256$ means blurry source material and $h_{\text{orig}} = 4000$ means crisp detail. At inference time, we set $h_{\text{orig}} = w_{\text{orig}} = 1024$ (or higher) to request the highest quality output.

For crop conditioning , the model receives the top-left crop coordinates $(c_{\text{top}}, c_{\text{left}})$ used during training:

e_{\text{crop}} = \text{MLP}\bigl(\text{fourier}(c_{\text{top}}) \| \text{fourier}(c_{\text{left}})\bigr)

At the boundary where $c_{\text{top}} = c_{\text{left}} = 0$ (no crop offset), the model generates images with the subject properly framed and centred. When the crop coordinates are large, the model has learned that the training image was a peripheral crop — objects may be partially out of frame. At inference, setting both to zero tells the model: "this is a full, uncropped composition", eliminating the centre-bias problem that plagued earlier models.

The total micro-conditioning embedding is the sum of the size and crop embeddings, added to the diffusion timestep embedding $e_t$:

e = e_t + e_{\text{size}} + e_{\text{crop}}

This combined embedding is injected into every residual block of the U-Net through adaptive group normalisation (the same mechanism used for timestep conditioning). The overhead is negligible — a few extra MLP layers — but the quality improvement is substantial.

The SDXL Refiner: A Minimal Cascade

Even with a 3.5B parameter base model trained at 1024x1024, SDXL's authors found that image quality improved further with a two-stage process. The SDXL refiner is a separate diffusion model trained specifically on the low-noise portion of the denoising schedule — the final steps where fine details like skin texture, fabric weave, and sharp edges are rendered.

The process works as follows. The base model runs the full denoising schedule from pure noise ($t = T$) down to some intermediate noise level ($t = t_{\text{switch}}$). At that point, the partially denoised latent is handed to the refiner, which completes the remaining steps from $t_{\text{switch}}$ down to $t = 0$. The switch point is a hyperparameter — typically around $t_{\text{switch}} = 0.2T$ to $0.3T$, meaning the base handles roughly 70-80% of the denoising and the refiner handles the final 20-30%.

Why does this help? Different noise levels require different skills. At high noise levels ($t$ near $T$), the model makes global decisions: overall layout, object placement, colour palette. At low noise levels ($t$ near 0), the model refines local details: sharpness, texture, fine structures. A single model must be good at both, which is a challenging multi-task learning problem. Splitting the work across two specialised models lets each focus on what it does best.

Alternatively, the refiner can operate in an SDEdit (Meng et al., 2022) style: take the fully denoised output of the base model, add a controlled amount of noise back to it (to noise level $t_{\text{edit}}$), then denoise again with the refiner. This is mathematically equivalent to:

\mathbf{z}_{\text{noised}} = \sqrt{\bar{\alpha}_{t_{\text{edit}}}} \, \mathbf{z}_{\text{base}} + \sqrt{1 - \bar{\alpha}_{t_{\text{edit}}}} \, \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

When $t_{\text{edit}}$ is small (say 0.1T), only fine details are perturbed and regenerated — the overall composition is preserved. When $t_{\text{edit}}$ is large (say 0.5T), the refiner has freedom to substantially alter the image. This gives practitioners a controllable knob: small $t_{\text{edit}}$ for detail enhancement, large $t_{\text{edit}}$ for more aggressive refinement.

This two-model design is the simplest form of a cascaded diffusion architecture : multiple diffusion models chained together, each responsible for a different level of detail. SDXL's cascade is minimal (same resolution, two stages), but the principle scales to much deeper cascades.

Imagen: The Text Encoder Matters Most

Google's Imagen (Saharia et al., 2022) took cascading to its logical extreme with a three-stage pipeline, and in the process revealed a surprising finding: scaling the text encoder improves image quality more than scaling the diffusion model .

Imagen uses T5-XXL (4.6B parameters) as its text encoder — a frozen, pre-trained language model far larger than CLIP. The authors ran controlled experiments: holding the diffusion model fixed and scaling the text encoder from T5-Small (60M) to T5-XXL (4.6B) improved FID scores and human preference ratings dramatically. Scaling the diffusion model by the same factor helped much less. The explanation is intuitive: the diffusion model can only generate what it understands from the text conditioning. A richer, more nuanced text representation gives the diffusion model better instructions to follow.

The architecture is a three-stage cascade, where each stage is a separate diffusion model trained independently:

Stage 1 — Base model: generates 64x64 images conditioned on T5-XXL text embeddings. This is where all the semantic content is decided: layout, objects, relationships, style.
Stage 2 — Super-resolution 64 to 256: takes the 64x64 output and upscales it to 256x256, adding medium-scale detail like object boundaries and surface textures.
Stage 3 — Super-resolution 256 to 1024: takes the 256x256 output and upscales to the final 1024x1024, adding fine-grained detail like hair strands, text rendering, and material properties.

Why is this cascade so much cheaper than generating 1024x1024 directly? The cost of the diffusion process scales roughly with the number of pixels (since the U-Net must process feature maps at that spatial resolution for each denoising step). Compare the pixel counts:

\text{Direct: } 1024^2 = 1{,}048{,}576 \text{ pixels}

\text{Cascade: } 64^2 + 256^2 + 1024^2 = 4{,}096 + 65{,}536 + 1{,}048{,}576 = 1{,}118{,}208 \text{ pixels}

At first glance the total pixel count looks similar — the final stage still processes 1024x1024. But the key insight is that the number of denoising steps differs dramatically across stages. The base model (64x64) requires many steps to build the full semantic composition from noise — typically 50-100 steps. The super-resolution models need far fewer steps (20-30) because they start from a structured image, not pure noise. The expensive semantic reasoning happens at 64x64 ($4{,}096$ pixels), not at 1024x1024 ($1{,}048{,}576$ pixels). That's a 256x reduction in spatial cost for the hardest part of generation.

💡 The cascade principle generalises: any time you can split generation into "decide what to generate" (cheap, low-res) and "add detail" (conditioned on existing structure), you save compute. The super-resolution stages are much simpler models because they don't need to solve the harder problem of composition from scratch — they just need to hallucinate plausible high-frequency detail consistent with the low-resolution input.

Each super-resolution model in Imagen is conditioned on both the text embedding and the low-resolution image from the previous stage. The low-resolution image is upsampled (via bilinear interpolation) to the target resolution and concatenated channel-wise with the noisy input:

\mathbf{x}_{\text{input}} = [\mathbf{z}_t; \; \text{upsample}(\mathbf{x}_{\text{low-res}})] \in \mathbb{R}^{H \times W \times (C + C_{\text{low}})}

Where $\mathbf{z}_t$ is the noisy target at timestep $t$, and $\mathbf{x}_{\text{low-res}}$ is the output from the previous cascade stage. The model learns to use the low-resolution structure as a guide while filling in the missing high-frequency detail. Imagen also applies noise augmentation to the low-resolution conditioning image during training: adding Gaussian noise at a random level $s$ (drawn from a schedule). This prevents the super-resolution model from becoming brittle — if it trains only on clean low-res inputs, any imperfections in the base model's output at inference time will cause cascading errors.

Conditioning Tricks That Punch Above Their Weight

SDXL's micro-conditioning (size and crop metadata) is part of a broader pattern in diffusion model design: turning cheap metadata into powerful conditioning signals. These tricks require minimal architectural changes but meaningfully improve generation quality.

The aesthetic score conditioning technique, used in models like Stable Diffusion 2.1 and DeepFloyd IF, adds a scalar quality rating to the conditioning. During training, each image is scored by a pre-trained aesthetic predictor (typically a linear probe on top of CLIP embeddings). The score is Fourier-encoded and added to the timestep embedding, just like size and crop conditioning. At inference, setting the aesthetic score to a high value (e.g. 7.0 on a 1-10 scale) biases the model toward generating images that match the visual qualities — sharpness, composition, colour harmony — associated with high-rated training images.

All of these conditioning signals share the same mathematical form. Given a scalar or vector metadata value $m$, encode it via Fourier features and project it into the model's hidden dimension:

e_m = \text{Linear}\bigl(\text{SiLU}(\text{Linear}(\text{fourier}(m)))\bigr)

The Fourier encoding maps the scalar into a high-dimensional sinusoidal representation (the same technique used for diffusion timesteps), and the two-layer MLP with SiLU activation projects it into the model's embedding space. This is then added to the timestep embedding $e_t$ and injected into every residual block via adaptive normalisation. The key insight: any piece of metadata that correlates with image properties can be turned into a conditioning signal this way. The model learns to disentangle these factors during training, and at inference time we can set each factor independently.

Consider what happens at the extremes of aesthetic conditioning. At the minimum score ($\sim$1.0), the model generates images resembling the lowest-quality training data: blurry, poorly composed, washed-out. At the maximum score ($\sim$10.0), quality improves dramatically, but the model may also reduce diversity — it gravitates toward a narrower set of "conventionally attractive" compositions. There's a tradeoff between quality and diversity that mirrors what we see with classifier-free guidance strength.

💡 The lesson from SDXL's conditioning innovations is that you don't always need a bigger model — sometimes you just need better metadata. Telling the model what resolution the source image was, where it was cropped, and how aesthetically pleasing it is are essentially free signals that meaningfully reduce the ambiguity the model must resolve on its own.

The Limits of U-Net Architectures

SDXL pushed the U-Net architecture to its practical limit. At 3.5B parameters, the model is already expensive to train and run inference on. But more importantly, the U-Net has fundamental architectural constraints that no amount of scaling can overcome.

U-Nets are built primarily from convolutional layers , which have two key inductive biases: locality (each output pixel depends only on a small spatial neighbourhood defined by the kernel size) and translation equivariance (a pattern is recognised the same way regardless of where it appears). These biases are excellent for spatial data — they're why CNNs dominated computer vision for a decade. But they also limit the model's ability to reason about global structure.

Consider generating an image of "a person holding a mirror reflecting their face". The model must ensure that the reflection is spatially consistent with the person — the face in the mirror must match the face being reflected, positioned correctly given the mirror's angle. Convolutional layers with 3x3 kernels can only propagate information across the full image by stacking many layers (the receptive field grows linearly with depth). Self-attention layers, which SDXL includes at certain resolutions (32x32 and 16x16 feature maps), allow global reasoning, but they're computationally expensive:

\text{Self-attention cost} = \mathcal{O}(n^2 \cdot d)

where $n$ is the number of spatial tokens (pixels or patches) and $d$ is the embedding dimension. For a 64x64 feature map, $n = 4{,}096$ and $n^2 = 16{,}777{,}216$ — already expensive. For a 128x128 feature map, $n = 16{,}384$ and $n^2 \approx 268$ million. This quadratic scaling is why SDXL only applies attention at the lower-resolution levels of the U-Net, not at full resolution. The highest-resolution layers (64x64 and 128x128 feature maps in latent space) use only convolutions, meaning they have no mechanism for global reasoning at those scales.

The U-Net's encoder-decoder structure with skip connections also imposes a fixed multi-scale hierarchy. Information flows down through the encoder (halving spatial resolution at each level), across the bottleneck, and back up through the decoder (doubling resolution at each level), with skip connections carrying high-resolution features from encoder to decoder. This is a strong architectural prior that works well for many tasks, but it's rigid: the number of scales, the resolution at each scale, and the flow of information are all fixed at architecture design time.

These limitations motivated a fundamental architectural question: what if we replaced the U-Net entirely with a transformer ? Transformers treat the image as a sequence of patches and apply self-attention across all of them at every layer — global reasoning everywhere, not just at the bottleneck. The quadratic attention cost is still there, but techniques like FlashAttention and patch-based tokenisation make it manageable. The next article covers this shift: the Diffusion Transformer (DiT) , which replaces the U-Net with a plain vision transformer and unlocks a new scaling regime for image and video generation.

Quiz

Test your understanding of SDXL and cascaded diffusion architectures.

Why does Stable Diffusion 1.5 produce repeated patterns when generating 1024x1024 images?

The VAE cannot decode images larger than 512x512

The U-Net was only trained on 64x64 latent spatial dimensions (512x512 pixels) and cannot generalise to larger feature maps

The CLIP text encoder truncates prompts at high resolutions

The noise schedule was only calibrated for 512x512 images

What was Imagen's most surprising finding about scaling?

Larger diffusion models always produce better images

Cascaded architectures are always slower than direct generation

Scaling the text encoder (T5-XXL) improved image quality more than scaling the diffusion model

Super-resolution models need more parameters than base models

In SDXL, what is the purpose of crop conditioning ($c_{\text{top}}, c_{\text{left}}$)?

It allows the model to generate images at arbitrary aspect ratios

It tells the model where the training crop was taken, so at inference setting (0, 0) produces properly framed compositions

It controls the zoom level of the generated image

It selects which region of the latent space to sample from

Why is Imagen's three-stage cascade cheaper than generating 1024x1024 directly, even though the final stage still processes 1024x1024?

The cascade uses a smaller VAE at each stage

The expensive semantic reasoning (many denoising steps) happens at 64x64, while the super-resolution stages need far fewer steps since they start from structured images

Each stage uses a different, cheaper noise schedule

The super-resolution stages use only convolutional layers, not attention