Control and Customization

Text Isn't Enough

A text prompt like "a sunset over mountains" gives you a sunset over mountains, but which sunset? Which mountains? From what camera angle? With what depth layout? In whose painting style? Text is a high-level, semantic control signal — it specifies what to generate but gives almost no control over how to generate it. Every time you run the same prompt, the model invents a new composition, pose, perspective, and colour palette. For creative professionals, this randomness is a problem: an illustrator needs a character in a specific pose, an architect needs a building rendered from a specific depth map, and a brand designer needs outputs that match a specific visual style.

How do you specify exact pose, edge structure, depth layout, colour palette, or the style of a specific artist? The generation community developed several complementary control mechanisms, each targeting a different axis of control. This article covers the four most important ones: ControlNet (structural conditions like edges and depth), IP-Adapter (image-based style and reference conditioning), LoRA (lightweight model customisation), and Textual Inversion (teaching new concepts through embeddings). Together, they form a modular control stack that can be mixed, matched, and composed.

ControlNet: Adding Structural Conditions

The most direct way to control image structure is to provide an explicit spatial condition: an edge map, a depth map, a pose skeleton, a segmentation mask. ControlNet (Zhang et al., 2023) introduced a clean architectural pattern for injecting such conditions into a pretrained diffusion model without destroying its learned capabilities.

The key idea: clone the encoder half of the U-Net , train the clone on (condition, image) pairs, and inject its outputs back into the original frozen model via zero-initialised convolutions . The original U-Net's weights are completely frozen — they never change during ControlNet training. The cloned encoder (the "trainable copy") processes the spatial condition (e.g., a Canny edge map) and produces feature maps at each resolution level. These feature maps are then added to the frozen U-Net's skip connections, giving the denoiser structural guidance at every scale.

Why clone the encoder rather than train a new network from scratch? Because the cloned encoder starts with all the pretrained features the base model already learned — texture detectors, edge responses, semantic groupings. Training from scratch would take far longer and require far more data to rediscover these features. The clone reuses them immediately and only needs to learn how to map from the new condition modality (edges, depth, pose) to the existing feature space.

The critical architectural detail is the zero-conv — a $1 \times 1$ convolution whose weights and biases are both initialised to zero:

y = \text{Conv}(x;\; W\!=\!0,\; b\!=\!0)

Let's check what happens at the boundaries. At the start of training, $W = 0$ and $b = 0$, so the output $y = 0$ for any input $x$. This means the ControlNet contributes exactly nothing to the frozen U-Net at initialisation — the base model's behaviour is perfectly preserved from step zero. As training progresses, the zero-conv weights gradually learn non-zero values, and the ControlNet's influence smoothly increases from zero. If the weights somehow grew to very large values, the ControlNet features would dominate the skip connections and overwhelm the base model's own features, but in practice gradient descent finds a balanced regime because the training loss penalises both ignoring the condition and destroying image quality.

💡 This "start as identity" principle appears throughout modern deep learning. It is the same idea behind LoRA's $B=0$ initialisation (the adapter starts with zero effect on the base model) and DiT's AdaLN-Zero (gating parameters initialised to zero so each block starts as an identity function). The pattern is: when adding new parameters to a pretrained model, initialise them so the model's output is unchanged at step zero, then let training gradually turn on the new capacity.

The condition can be almost any spatial signal that aligns with the output image resolution:

Canny edge maps: binary edges extracted from a reference image. Controls the outline structure.
Depth maps: per-pixel depth estimated by models like MiDaS. Controls the 3D layout and perspective.
OpenPose skeletons: keypoint-based human body and hand poses. Controls character posture.
Segmentation maps: semantic labels (sky, ground, building) that control the spatial arrangement of object categories.
Normal maps: surface orientation vectors. Controls the 3D surface geometry and lighting response.

Each condition type requires a separately trained ControlNet model, since the mapping from Canny edges to image features is very different from the mapping from depth maps to image features. However, multiple ControlNets can be stacked at inference time — for example, depth + pose simultaneously — by simply adding the feature contributions from each ControlNet to the same skip connections. The outputs are additive, so they compose naturally.

# ControlNet's zero-conv: output is exactly zero at initialisation
import random

# Simulate a 1x1 conv with zero-initialised weights and bias
# For a real conv: y = W * x + b, with W=0 and b=0

W = 0.0  # weight initialised to zero
b = 0.0  # bias initialised to zero

# Input feature values (arbitrary)
inputs = [random.uniform(-5, 5) for _ in range(6)]

print("Zero-conv at initialisation (W=0, b=0):")
print(f"  Inputs:  {[f'{x:.2f}' for x in inputs]}")
outputs = [W * x + b for x in inputs]
print(f"  Outputs: {[f'{y:.2f}' for y in outputs]}")
print(f"  => ControlNet contributes NOTHING to the frozen U-Net")
print()

# After some training, W and b become non-zero
W_trained = 0.35
b_trained = 0.02
print(f"After training (W={W_trained}, b={b_trained}):")
outputs_trained = [W_trained * x + b_trained for x in inputs]
print(f"  Inputs:  {[f'{x:.2f}' for x in inputs]}")
print(f"  Outputs: {[f'{y:.2f}' for y in outputs_trained]}")
print(f"  => ControlNet now contributes meaningful features")

IP-Adapter: Image Prompt Conditioning

ControlNet gives structural control, but what if you want to control style ? Try describing the visual style of a specific painting, the exact appearance of a face, or the precise colour palette of a photograph in text. It is extremely difficult. Text is a lossy channel for visual information — describing a style that took an artist years to develop in a few words inevitably loses most of the nuance.

IP-Adapter (Ye et al., 2023) solves this by conditioning on an image instead of (or in addition to) text. Give it a reference painting and it transfers the style. Give it a face photo and it preserves the identity. The reference image speaks directly in the visual domain, bypassing the lossy text bottleneck entirely.

How does it work architecturally? A pretrained CLIP image encoder extracts features from the reference image, producing a sequence of image tokens. These tokens are then injected into the diffusion model via a separate cross-attention layer that runs in parallel to the existing text cross-attention. This is the key design choice: the text pathway and the image-prompt pathway are decoupled , each with its own cross-attention keys and values.

In the standard text-conditioned diffusion model, each cross-attention layer computes:

Z_{\text{text}} = \text{softmax}\!\left(\frac{Q \cdot K_{\text{text}}^\top}{\sqrt{d_k}}\right) V_{\text{text}}

where $Q$ comes from the noisy image features (queries), and $K_{\text{text}}, V_{\text{text}}$ come from the text encoder output. IP-Adapter adds a parallel cross-attention with its own learned projection weights:

Z_{\text{IP}} = \text{softmax}\!\left(\frac{Q \cdot K_{\text{ref}}^\top}{\sqrt{d_k}}\right) V_{\text{ref}}

where $K_{\text{ref}}, V_{\text{ref}}$ are projected from the CLIP image features of the reference image. The two outputs are then combined with a weighting parameter:

Z = Z_{\text{text}} + \lambda \, Z_{\text{IP}}

Let's check the boundaries of $\lambda$. When $\lambda = 0$, the IP-Adapter contribution vanishes entirely and the model behaves exactly as the original text-conditioned model — the reference image has no effect. When $\lambda = 1$, the image-prompt features are weighted equally with the text features. As $\lambda$ grows beyond 1, the reference image increasingly dominates: the output will match the reference style or identity more closely, but may start ignoring the text prompt. In practice, $\lambda \in [0.5, 1.0]$ gives a good balance between text controllability and reference fidelity.

Because the text and IP-Adapter pathways are decoupled, they control different aspects of the output without interfering with each other. The text prompt still controls the scene ( what to generate), while the reference image controls appearance ( how it looks). Common use cases include:

Style transfer: use a painting as the reference image. The output follows the text prompt's content but renders in the painting's style.
Face consistency: use a face photograph as the reference. The output preserves the person's identity across different scenes described by text.
Object preservation: use a product photo as the reference. The output places that specific product in new contexts.

💡 Why a separate cross-attention rather than concatenating image tokens to the text tokens? Concatenation would force text and image features to compete for attention within the same softmax, making the balance between them hard to control. Decoupled cross-attention lets us tune $\lambda$ independently at inference time — no retraining needed to shift the text-vs-reference balance.

LoRA for Diffusion: Lightweight Customisation

ControlNet and IP-Adapter control structure and style at inference time, but what if you want the model itself to permanently learn a new style, a specific character, or a novel concept? This is where LoRA (Low-Rank Adaptation) enters — the same technique covered in the fine-tuning track ( see the LoRA article ), now applied to diffusion U-Nets and DiTs instead of language models.

The idea is identical: freeze the base model's weight matrix $W_0$ and train two small low-rank matrices $A$ and $B$ such that the effective weight becomes $W_0 + BA$, where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with $r \ll d$:

W = W_0 + BA, \quad B \in \mathbb{R}^{d \times r}, \; A \in \mathbb{R}^{r \times d}

Since $B$ is initialised to zero, the product $BA = 0$ at the start of training, so the model's output is unchanged at step zero — the same "start as identity" principle we saw in ControlNet's zero-conv. The rank $r$ is typically between 4 and 32 for diffusion models, targeting the attention layers (query, key, value, and output projections) in the U-Net or DiT. Because $r$ is so small compared to $d$ (which can be 1024 or more), a LoRA adds only a tiny fraction of new parameters.

For diffusion models, three types of LoRA have become especially common:

Style LoRAs: trained on 20-50 images in a specific artistic style (anime, watercolour, pixel art, a specific illustrator's style). After training, any prompt generates in that style. The LoRA has learned the colour palettes, brushstroke patterns, and compositional preferences that define the style.
Character LoRAs: trained on 10-20 images of a specific character (fictional or real). The model learns the character's consistent visual features — face shape, hair, clothing, proportions — and can render them in new poses and scenes.
Concept LoRAs: trained on images of a specific object, product, or location. Teaches the model a new visual concept it has never seen: your specific product, your specific building, your specific pet.

LoRA files are tiny — typically 5 to 200 MB, compared to several gigabytes for the base model. This small size has an important consequence: LoRAs can be stacked and combined . You can apply a style LoRA and a character LoRA simultaneously, each weighted by a scalar. The effective weight becomes:

W = W_0 + w_1 \cdot B_1 A_1 + w_2 \cdot B_2 A_2 + \cdots

where $w_i$ controls each LoRA's influence. When all $w_i = 0$, we recover the base model exactly. As any $w_i$ increases, that LoRA's learned behaviour becomes stronger. In practice, $w_i \in [0.5, 1.0]$ for each LoRA works well; values much above 1.0 tend to oversaturate the style and degrade image quality.

This composability and small file size created a thriving ecosystem. Platforms like CivitAI host thousands of community-trained LoRAs — for styles, characters, concepts, poses, lighting setups — all composable with any compatible base model. A user can download a base model like SDXL, then layer on a style LoRA, a character LoRA, and a lighting LoRA, each made by a different creator, and combine them in a single generation.

# LoRA parameter count vs full model — diffusion U-Net example

d = 1024        # typical hidden dimension in SDXL U-Net attention layers
num_layers = 70 # approximate number of attention projections (Q, K, V, Out across blocks)

print("LoRA parameter counts for a diffusion model")
print("=" * 55)
print(f"Hidden dimension d = {d}")
print(f"Number of target layers = {num_layers}")
print(f"Full parameter count per layer = d * d = {d*d:,}")
print(f"Full params (all layers) = {d*d*num_layers:,}")
print()
print(f"{'Rank r':<10} {'Params/layer':<16} {'Total LoRA':<16} {'% of full':<10}")
print("-" * 55)
for r in [4, 8, 16, 32]:
    per_layer = 2 * d * r  # A is r x d, B is d x r
    total = per_layer * num_layers
    pct = total / (d * d * num_layers) * 100
    print(f"{r:<10} {per_layer:<16,} {total:<16,} {pct:<10.2f}%")
print()
print("Even r=32 is <7% of the full model parameters")

Textual Inversion: Teaching New Words

What if we want something even simpler than LoRA — a way to teach the model a new concept without changing any model weights at all? Textual Inversion (Gal et al., 2022) does exactly this. Instead of modifying the denoiser, it learns a single new text embedding — a vector in the text encoder's embedding space — that represents a visual concept.

The setup: you have 3-5 images of a concept you want the model to learn (say, your pet dog). You introduce a new placeholder token $v^*$ into the text encoder's vocabulary and initialise its embedding randomly. Then you train only that embedding vector $v^* \in \mathbb{R}^{d_{\text{text}}}$ while keeping the entire diffusion model and the rest of the text encoder frozen. The training objective is the same standard diffusion loss — predict the noise added to images of your concept — but the only free parameter being optimised is the single embedding vector $v^*$.

\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t}\left[\|\epsilon - \epsilon_\theta(x_t, t, c_{\text{text}}(v^*))\|^2\right]

Here $c_{\text{text}}(v^*)$ is the text conditioning that includes the learned token $v^*$. Only $v^*$ receives gradients; everything else is frozen. After training, the model treats $v^*$ as an ordinary word. You can write prompts like "a painting of $v^*$ in the style of Van Gogh" or "$v^*$ sitting on a beach at sunset", and the model composes $v^*$ with the rest of the prompt just as it would compose any two words.

The advantage is extreme simplicity: the learned artifact is a single vector (typically 768 or 1024 floats, just a few kilobytes). It's trivially shareable, composable with any prompt, and cannot break the model because it modifies nothing. Multiple textual inversions can coexist — each just adds one new embedding to the vocabulary.

The disadvantage is equally clear: a single embedding vector has limited expressiveness. It must compress everything about a concept — shape, colour, texture, identity — into one point in embedding space. For simple concepts (a specific texture, a colour palette), this works well. For complex concepts with many distinguishing features (a detailed character, a nuanced artistic style), a single vector is not enough, and LoRA (which modifies thousands of parameters across the denoiser) will capture far more detail.

# Textual Inversion vs LoRA: what gets trained?

d_text = 768     # CLIP text embedding dimension (SD 1.5)
d_model = 1024   # U-Net hidden dimension
num_lora_layers = 70
lora_rank = 8

# Textual inversion: ONE embedding vector
ti_params = d_text
ti_bytes = ti_params * 4  # float32

# LoRA: low-rank matrices across many layers
lora_params = 2 * d_model * lora_rank * num_lora_layers
lora_bytes = lora_params * 4

# Full model (approximate for SD 1.5 U-Net)
full_params = 860_000_000
full_bytes = full_params * 4

print("Parameter comparison")
print("=" * 50)
print(f"{'Method':<22} {'Parameters':<16} {'File size':<14}")
print("-" * 50)
print(f"{'Textual Inversion':<22} {ti_params:<16,} {ti_bytes / 1024:.1f} KB")
print(f"{'LoRA (r=8)':<22} {lora_params:<16,} {lora_bytes / (1024**2):.1f} MB")
print(f"{'Full model':<22} {full_params:<16,} {full_bytes / (1024**3):.1f} GB")
print()
print("Textual Inversion: extreme simplicity, one vector, a few KB")
print("LoRA: much more expressive, but still <1% of the full model")

Combining Controls

The real power of these techniques is that they are composable . Each one controls a different axis of the generation process, and they can be stacked without interfering with each other:

Text prompt: controls the scene content ("a woman standing in a garden").
ControlNet: controls the spatial structure (the pose skeleton specifies her exact posture, the depth map defines the garden layout).
IP-Adapter: controls the visual style (a reference painting sets the colour palette and brushstroke style).
LoRA: controls the character (a character LoRA ensures the woman has a consistent, specific appearance across all generations).
Textual Inversion: controls a specific concept (a learned token $v^*$ representing a specific flower variety that fills the garden).

Why does this composability work? Because each mechanism operates at a different point in the architecture:

Textual Inversion modifies the input to the text encoder (one new embedding in the vocabulary).
IP-Adapter adds a parallel cross-attention pathway (separate from text cross-attention).
ControlNet adds to the U-Net's skip connections (structural features at each resolution).
LoRA modifies the attention weights themselves (low-rank additive updates).

Since they modify different parts of the network, their effects are largely orthogonal. This modularity is why the Stable Diffusion ecosystem became so rich: the base model is a foundation, and the community builds an ever-growing library of control modules that plug into it. A single generation can combine a base model, two ControlNets (depth + pose), an IP-Adapter reference, a style LoRA, a character LoRA, and a textual inversion token — each contributed by a different creator, each controlling a different aspect of the final image.

💡 This modular control stack is unique to diffusion models. GANs had no comparable ecosystem because their architecture did not cleanly separate structural control, style control, and concept customisation into independent, pluggable modules. The diffusion framework's additive, skip-connection-based architecture made this decomposition natural.

Quiz

Test your understanding of diffusion model control and customisation techniques.

Why does ControlNet initialise its convolution layers with zero weights and zero biases (zero-conv)?

To reduce memory usage during the first training steps

So the ControlNet has no effect on the frozen base model at the start of training, preserving its quality while the adapter gradually learns to contribute

To force the model to learn only from the text prompt initially

Because non-zero initialisation causes gradient explosion in convolutional networks

In IP-Adapter, what does the weighting parameter $\lambda$ control?

The learning rate for the cross-attention layers

The resolution of the reference image fed to the CLIP encoder

The balance between text-conditioned features and image-prompt features in the cross-attention output

The number of denoising steps used during inference

What is the main limitation of Textual Inversion compared to LoRA for learning a new visual concept?

Textual Inversion modifies the diffusion model weights, making it less portable

Textual Inversion requires thousands of training images

A single embedding vector has limited capacity to capture complex visual concepts, while LoRA modifies parameters across many layers of the denoiser

Textual Inversion cannot be composed with text prompts

Why can ControlNet, IP-Adapter, LoRA, and Textual Inversion be composed in a single generation without interfering with each other?

They all modify the same attention weights but in complementary directions

They each operate at different points in the architecture: text encoder input, parallel cross-attention, skip connections, and attention weight updates respectively

The diffusion sampling process automatically separates their effects across timesteps

They are all applied after the denoising is complete, during the VAE decoding step