DALL-E 3 and Imagen

Why Do Image Models Ignore Parts of Your Prompt?

You type "a green cat wearing a red hat sitting on a blue chair" into Stable Diffusion and get back a red cat on a green chair with no hat. The objects are there, the colours are there, but the bindings are wrong — the model assigned colours to the wrong objects, dropped an attribute entirely, and scrambled the spatial arrangement. This is not a random failure. It happens systematically because the diffusion model does not understand your prompt the way you do. It sees a bag of visual concepts (cat, green, red, hat, chair, blue) and combines them in ways that are statistically plausible but semantically wrong.

This is the text-image alignment problem : the gap between what the user described and what the model generates. It manifests in several specific failure modes: wrong attribute binding (the red hat becomes a red cat), wrong spatial relationships ("above" becomes "next to"), wrong object count ("three apples" yields two or five), and garbled text rendering ("a sign saying HELLO" produces nonsense characters). All of these stem from the same root cause: the text encoder compresses the prompt into a representation that loses compositional structure.

Three distinct strategies emerged to fix this:

Better text encoders: use a text encoder that actually understands language structure (T5-XXL instead of CLIP). This is the Imagen approach.
Better training data: rewrite every caption in the training set so the model learns from precise, detailed descriptions instead of noisy alt-text. This is the DALL-E 3 approach.
Better architecture: change how text features and image features interact inside the denoiser (MMDiT joint attention in SD3/Flux, covered in the previous article).

This article covers the first two — the data-centric approach of DALL-E 3 and the model-centric approach of Imagen — because they represent fundamentally different philosophies that both proved remarkably effective.

DALL-E 3: Fix the Training Data

The DALL-E 3 team at OpenAI started from a simple observation: image-text pairs scraped from the internet are terrible. The "caption" for a stunning photo of a sunset over the Grand Canyon might be IMG_2847.jpg . A detailed product photo might have alt-text saying "image". A painting of a woman in a red dress standing in a field of sunflowers might be captioned "pretty pic" (Betker et al., 2023) . When you train a diffusion model on these pairs, the model learns that vague descriptions map to detailed images. So when a user later provides a detailed, specific prompt, the model has no training signal for how to faithfully follow it — it was never trained on precise captions.

DALL-E 3's solution is elegant: if the internet's captions are bad, rewrite all of them . The approach has three steps:

Step 1: Train a captioning model. OpenAI trained a specialised vision-language model to produce highly descriptive, accurate captions for any image. This captioner doesn't just say "a cat" — it says "a green tabby cat wearing a small red knitted hat, sitting on a blue wooden chair in a sunlit kitchen, with a white tile floor and a window in the background." It describes colours, positions, spatial relationships, materials, lighting, and text visible in the image.

Step 2: Recaption the entire training dataset. Run the captioner on every image in the training corpus — billions of images — replacing the original noisy alt-text with the captioner's detailed description. This produces a new dataset of (image, detailed_caption) pairs.

Step 3: Train the diffusion model on the recaptioned data. The diffusion model now sees only precise, detailed captions during training. It never encounters "IMG_2847.jpg" or "pretty pic". Every training example teaches it the relationship between a specific description and a specific visual.

💡 Why does this work so well? Consider what happens during training. When every caption says exactly which object has which colour, the model gets gradient signal to bind attributes correctly. When every caption describes spatial relationships ("the cat is ON the chair, not beside it"), the model learns to respect those prepositions. The compositional skills emerge from compositional training data — the model can only learn what the captions teach it.

The results are dramatically better prompt adherence across all the failure modes we listed:

Spatial relationships: "above", "behind", "between" are correctly rendered because the recaptioner always describes spatial layout.
Attribute binding: the correct colour, size, and material attach to the correct object because captions always pair attributes with their objects.
Counting: "exactly three apples" works better because the captioner counts objects.
Text rendering: text in images improved because the captioner transcribes visible text, giving the model training signal for character shapes.

Critically, this is a data solution, not an architecture solution . The DALL-E 3 model architecture is relatively standard — a U-Net denoiser conditioned on CLIP text embeddings via cross-attention, with a T5 encoder added for richer text understanding. The magic is entirely in the training data quality. This insight — that a mediocre model trained on excellent data can outperform an excellent model trained on mediocre data — has proven to be one of the most impactful ideas in generative AI.

One practical implication: at inference time, DALL-E 3 in ChatGPT uses LLM-mediated prompting . The user's short prompt ("a cat on a chair") is first rewritten by GPT-4 into a long, detailed description before being sent to the diffusion model. This bridges the gap between how humans naturally write prompts (short and vague) and how the recaptioned model expects them (long and detailed). The LLM acts as a translator between human intent and the level of specificity the model was trained on.

Imagen: Scale the Text Encoder

Imagen (Saharia et al., 2022) from Google Brain took a completely different approach to the same problem. Instead of fixing the training data, Imagen asked: what if the text encoder itself were much more powerful? What if it could actually parse the compositional structure of a prompt?

The key experiment was a head-to-head comparison: scale the diffusion model (the U-Net denoiser) vs. scale the text encoder. The result was unambiguous — scaling the text encoder improved image quality and text-image alignment far more than scaling the denoiser . Specifically, replacing CLIP's text encoder (63M parameters) with Google's T5-XXL (4.6B parameters) dramatically improved both FID (image quality) and CLIP score (text-image alignment), while proportionally increasing the U-Net size yielded much smaller gains. The text encoder was the bottleneck .

Why does a bigger text encoder help so much? The answer lies in how CLIP and T5 were trained. CLIP was trained on image-text pairs using a contrastive objective: learn to match images with their captions and reject mismatched pairs. This produces a model that recognises that "a panda" and a photo of a panda go together, but its language understanding is shallow. CLIP encodes prompts as if they were image search queries — it recognises visual concepts but struggles with compositional structure.

T5, by contrast, was trained on massive text-only corpora with diverse language tasks: translation, summarisation, question answering, sentence completion. It learned rich syntactic and semantic representations. Consider the prompt "a panda making latte art." CLIP sees something like a bag of concepts: {panda, making, latte, art}. T5 parses it as a structured representation: SUBJECT=panda, ACTION=making, OBJECT=latte art. That structural understanding propagates through cross-attention into the denoiser, producing an image where the panda is actually performing the action rather than just existing near a latte.

💡 This is why the CLIP text encoder struggles with negation ("a scene with no elephants" often produces elephants) and attribute binding ("a red cube and a blue sphere" might produce a blue cube). CLIP was trained to detect whether concepts are present, not to understand their relationships. T5, trained on language tasks requiring compositional understanding, handles these cases far better.

The Imagen architecture is a three-stage cascade:

Frozen T5-XXL text encoder: the pretrained T5-XXL is used as-is, with no fine-tuning. Its weights are frozen during diffusion training. This is important: it means the text representations are general-purpose language understanding, not adapted to image generation. The diffusion model learns to interpret T5's rich text embeddings.
Base diffusion model (64x64): a U-Net denoiser that generates a 64x64 image conditioned on T5 text embeddings via cross-attention. This is where text-image alignment is established.
Two super-resolution diffusion models (64 to 256 to 1024): separate U-Net denoisers that upscale the image in two steps. Each is also conditioned on T5 text embeddings, so the text guides both composition and fine detail.

Imagen also introduced dynamic thresholding , a technique for preventing colour saturation at high classifier-free guidance scales. In standard diffusion sampling, we compute a predicted clean image $\hat{x}_0$ at each denoising step. With high guidance (large $w$), the predicted pixel values can overshoot the valid range $[-1, 1]$, causing washed-out or oversaturated images. The naive fix is static thresholding — clip to $[-1, 1]$ — but this loses information and produces flat grey regions.

Dynamic thresholding instead selects a clipping threshold $s$ based on a percentile $p$ (typically $p = 99.5\%$) of the absolute values in $\hat{x}_0$ at that step:

s = \max\!\left(1,\; \text{percentile}\!\left(|\hat{x}_0|,\, p\right)\right)

Then the predicted image is clipped and rescaled:

\hat{x}_0 \leftarrow \text{clip}\!\left(\hat{x}_0,\, -s,\, s\right) \;/\; s

Let's walk through what this does at the boundaries. When the guidance scale is low and $\hat{x}_0$ values are well-behaved (all within $[-1, 1]$), the percentile will be at most 1, so $s = \max(1, \text{something} \leq 1) = 1$. Clipping to $[-1, 1]$ and dividing by 1 changes nothing — dynamic thresholding is invisible. When guidance is high and some values overshoot to, say, $[-2.5, 2.5]$, the 99.5th percentile might be $s = 2.3$. We clip to $[-2.3, 2.3]$ and then divide by 2.3, mapping the range back to approximately $[-1, 1]$. The key insight: instead of hard-clipping all overshooting values to exactly $\pm 1$ (which destroys relative differences between them), we preserve the relative structure of the prediction by rescaling. Bright regions stay brighter than dim regions instead of all being crushed to the same saturated value.

The $\max(1, \ldots)$ ensures we never amplify values. If all predictions are already within $[-1, 1]$, the threshold stays at 1. Without this floor, $s$ could be less than 1 (say, 0.3), and dividing by 0.3 would expand the range, artificially boosting contrast.

import numpy as np

def dynamic_threshold(x0_pred, percentile=99.5):
    """Dynamic thresholding from Imagen (Saharia et al., 2022)."""
    s = max(1.0, np.percentile(np.abs(x0_pred), percentile))
    x0_clipped = np.clip(x0_pred, -s, s) / s
    return x0_clipped, s

# Simulate predicted x0 at different guidance scales
np.random.seed(42)
base = np.random.randn(64, 64) * 0.4  # base prediction, well-behaved

print("Dynamic thresholding at different guidance scales")
print("=" * 62)
print(f"{'Guidance w':<12} {'Max |x0|':<12} {'Threshold s':<14} {'Max after':<12}")
print("-" * 62)

for w in [1.0, 3.0, 7.5, 15.0, 30.0]:
    x0 = base * w  # higher guidance -> larger values
    x0_fixed, s = dynamic_threshold(x0, percentile=99.5)
    print(f"{w:<12.1f} {np.max(np.abs(x0)):<12.2f} {s:<14.2f} {np.max(np.abs(x0_fixed)):<12.4f}")

print()
print("At w=1.0: s=1.0 (floor), no change needed")
print("At w=30.0: s adapts to the large values, rescaling instead of hard-clipping")

Two Philosophies, One Goal

DALL-E 3 and Imagen represent two fundamentally different strategies for the same problem. Which lever do you pull to improve text understanding?

DALL-E 3 (data-centric): the model is fine — the training data is the problem. Fix the captions, and a standard architecture will learn compositional generation.
Imagen (model-centric): the data is fine — the text encoder is too weak to represent compositional meaning. Use a language model that actually understands language, and the denoiser will receive richer conditioning signal.

Both approaches work. And they are complementary — nothing prevents you from using detailed recaptioned data with a T5 text encoder. In fact, that is exactly what happened. Stable Diffusion 3 (Esser et al., 2024) and Flux (Black Forest Labs, 2024) combine both insights: they use T5-XXL as a text encoder (Imagen's lesson) and train on high-quality, detailed captions (DALL-E 3's lesson), plus the MMDiT joint-attention architecture (the architectural lesson). The convergence of all three strategies explains why SD3 and Flux showed such a dramatic leap in prompt adherence over their predecessors.

Imagen 3 (Google DeepMind, 2024) likely combines both approaches as well, though published architectural details are sparse. What is known is that it shows further improvements in text rendering, spatial reasoning, and fine-grained detail — all hallmarks of the combined data + encoder strategy.

The interplay between the two approaches can be understood quantitatively. Consider the conditioning signal that reaches the denoiser. In a cross-attention layer, the image tokens attend to text tokens to extract conditioning information. The quality of this signal depends on two factors: (1) whether the text encoder captured the right compositional structure, and (2) whether that structure was reinforced during training by matching captions.

To make this concrete, think about the prompt "a red cube to the left of a blue sphere." We can roughly model the effective information reaching the denoiser as:

I_{\text{effective}} = I_{\text{encoder}}(\text{prompt}) \;\cdot\; P_{\text{training}}(\text{faithful rendering} \mid \text{caption quality})

The first term is the encoder's ability to represent compositional structure. If the encoder collapses "red cube to the left of blue sphere" into an unstructured bag of concepts, $I_{\text{encoder}}$ is low regardless of data quality. The second term is the probability that the model learned to faithfully render structured descriptions, which depends on how often the training captions included such structure. If every training caption is "nice image," the model has zero signal for binding "red" to "cube" — so $P_{\text{training}}$ is near zero regardless of encoder quality.

Imagen maximises the first term. DALL-E 3 maximises the second. SD3 and Flux maximise both. This explains the empirical observation that combining approaches yields improvements that are more than additive — a strong encoder on rich captions is multiplicatively better than either alone.

Midjourney: The Black Box

No article on commercial image generation would be complete without mentioning Midjourney , which has arguably been the most widely used image generation service since 2022. But unlike DALL-E 3 and Imagen, Midjourney publishes almost no architectural details. There is no paper, no model card, no public description of the training data or the text encoder or the denoiser architecture.

What is known (or can be reasonably inferred):

Diffusion-based: the iterative refinement visible in the generation process is consistent with diffusion or flow-matching.
Aesthetic emphasis: Midjourney's key differentiator has been consistent aesthetic quality. This likely stems from careful curation of training data biased toward professional photography and art, possibly combined with aesthetic reward models during training or post-processing.
Rapid iteration: Midjourney v6 (December 2023) showed a dramatic leap in prompt adherence and text rendering — improvements consistent with adopting the same lessons as DALL-E 3 (better captions) and Imagen/SD3 (stronger text encoders), though this is speculation.

We mention Midjourney for completeness, but the lack of published research means we cannot teach its specific innovations. The lesson for a technical audience is that productionisation matters : curation, default settings, post-processing, and user-experience design can make a system feel dramatically better even without novel architecture. Much of what users perceive as "better generation" is actually better prompt engineering done on their behalf, better default negative prompts, and aesthetic filtering — the same LLM-mediated prompting idea that DALL-E 3 uses.

Quiz

Test your understanding of the text understanding approaches in DALL-E 3 and Imagen.

What is DALL-E 3's primary innovation for improving prompt adherence?

A novel transformer architecture that replaces the U-Net

Training a captioning model to rewrite all training captions with detailed descriptions, then training the diffusion model on the recaptioned data

Using reinforcement learning from human feedback to align outputs with prompts

Scaling the diffusion model to 100 billion parameters

In the Imagen paper, what did scaling experiments reveal about the bottleneck for text-to-image quality?

The VAE encoder was the main bottleneck, limiting image detail

The diffusion noise schedule needed to be redesigned for high-resolution outputs

Scaling the text encoder (T5-XXL) improved quality far more than scaling the U-Net denoiser

Scaling the U-Net denoiser was the most effective way to improve FID

Why does T5-XXL produce better text conditioning than CLIP's text encoder for diffusion models?

T5-XXL has more parameters, and more parameters always means better results

T5-XXL was trained on text-image pairs with higher resolution images

T5-XXL was trained on diverse language tasks (translation, QA, etc.) giving it rich compositional understanding, while CLIP was trained to match images to short captions

T5-XXL uses a convolutional architecture that is better at spatial reasoning

What does dynamic thresholding in Imagen prevent, and how does it work?

It prevents mode collapse by adding noise to the gradient during training

It prevents colour saturation at high guidance scales by clipping predicted x0 to a percentile-based range and rescaling, preserving relative differences instead of hard-clipping to [-1, 1]

It prevents the model from generating NSFW content by thresholding CLIP similarity scores

It prevents overfitting by dynamically adjusting the learning rate based on validation loss