Why Are Most Weight Updates Redundant?

In the previous article we saw that full fine-tuning updates every single parameter in the model. For a 7-billion-parameter LLM, that means computing and storing gradients, optimizer states, and weight updates for all 7 billion numbers. But here's the question that opens the door to LoRA: do we actually need all of those updates? When a pre-trained model adapts to a new task, are all 7 billion dimensions of change contributing useful information, or is most of that update just noise?

The answer turns out to be striking. (Aghajanyan et al., 2020) studied the intrinsic dimensionality of fine-tuning and found that the effective change during adaptation lives in a remarkably low-dimensional subspace. Their key finding: a model with 100 million parameters might only need around 200 dimensions of change to adapt to a new task. That's not 200 million, or even 200 thousand — just 200. Even though gradient descent updates millions of parameters during fine-tuning, the meaningful part of that update can be captured in a tiny fraction of the full parameter space.

Think about what this means in matrix terms. If we call the weight update $\Delta W$ (the difference between the fine-tuned weight matrix and the original pre-trained weight matrix), this finding tells us that $\Delta W$ has low rank . The rank of a matrix measures how many linearly independent "directions" it encodes. A weight matrix with shape $4096 \times 4096$ could in principle have rank 4096 (a full-rank matrix, where every direction is unique). But the intrinsic dimensionality hypothesis says that in practice, $\Delta W$ might have an effective rank of 8, 16, or 32 — the remaining thousands of directions are redundant or noise.

If $\Delta W$ is genuinely low-rank, then we should be able to decompose it into a product of two much smaller matrices without losing any meaningful information. And that's precisely the insight behind LoRA (Low-Rank Adaptation) : instead of learning the full $\Delta W$ and hoping it ends up low-rank, we force it to be low-rank from the start by parameterizing it as the product of two thin matrices. The result: we get nearly the same adapted model, but train orders of magnitude fewer parameters.

💡 The intrinsic dimensionality finding doesn't just apply to small models. Aghajanyan et al. showed that larger pre-trained models actually have lower intrinsic dimensionality for downstream tasks, meaning they are even better candidates for low-rank adaptation. The better the pre-training, the less you need to change.

The LoRA Decomposition

The LoRA paper (Hu et al., 2021) introduces a clean, elegant idea. Instead of updating a pre-trained weight matrix $W_0$ directly (which would require storing and optimizing a full $d \times k$ matrix of updates), we freeze $W_0$ completely and add a low-rank bypass alongside it. The forward pass becomes:

$$h = W_0 x + \frac{\alpha}{r} B A x$$

Every symbol in this equation matters, so let's unpack each one carefully.

$x \in \mathbb{R}^k$ is the input vector to this layer — it could be a token embedding or the output of a previous layer. Its dimension is $k$, matching the input dimension of the weight matrix.

$h \in \mathbb{R}^d$ is the output of the layer after the LoRA modification. In the original model, $h = W_0 x$. With LoRA, the output is $h = W_0 x + \frac{\alpha}{r} BA x$ — the base model's computation plus a low-rank correction term.

$W_0 \in \mathbb{R}^{d \times k}$ is the frozen pre-trained weight matrix, where $d$ is the output dimension and $k$ is the input dimension. "Frozen" means these weights do not receive gradient updates during fine-tuning. They are loaded from the pre-trained checkpoint and left untouched. The term $W_0 x$ is just the original model's computation — exactly what the layer would have done before any fine-tuning.

$A \in \mathbb{R}^{r \times k}$ is the down-projection matrix. It takes the input $x$ (a vector of dimension $k$) and compresses it down to $r$ dimensions. You can think of $A$ as asking: "of the $k$ input features, which $r$ linear combinations are most relevant for the task adaptation?" This is the bottleneck — all the information about the update must flow through just $r$ channels.

$B \in \mathbb{R}^{d \times r}$ is the up-projection matrix. It takes the compressed $r$-dimensional representation and projects it back up to $d$ dimensions, producing an output that can be added to $W_0 x$. Together, $A$ compresses and $B$ expands, forming a bottleneck architecture similar to an autoencoder.

$r$ is the rank — the single most important hyperparameter in LoRA. It controls how expressive the low-rank update can be. Typical values are $r \in \{4, 8, 16, 32, 64\}$. The product $BA \in \mathbb{R}^{d \times k}$ has the same shape as $W_0$, but its rank is at most $r$. Since $r \ll \min(d, k)$, this product can only represent a thin slice of all possible $d \times k$ matrices — exactly the low-dimensional subspace we believe the update lives in.

$\alpha$ is a scaling constant (typically set once and left fixed). The ratio $\alpha / r$ controls the magnitude of the LoRA update relative to the base model. Why divide by $r$? Because without this normalization, doubling $r$ would roughly double the magnitude of the update $BA x$, forcing you to also halve the learning rate to compensate. The $\alpha / r$ scaling decouples capacity (controlled by $r$) from magnitude (controlled by $\alpha$ and the learning rate), so you can sweep over $r$ without retuning the learning rate each time.

💡 A common convention is to set $\alpha = r$ (so $\alpha / r = 1$) or $\alpha = 2r$. When $\alpha = r$, the scaling factor drops out entirely and the update is simply $BA x$. The HuggingFace PEFT library defaults to $\alpha = 8$.

Now let's walk through the boundary cases to build intuition for what $r$ actually controls.

When $r = \min(d, k)$: the product $BA$ can represent any $d \times k$ matrix (it is no longer constrained to be low-rank). In this limit, LoRA is equivalent to full fine-tuning — the adapter has enough capacity to express any possible weight update. Of course, this also means we've gained nothing in parameter efficiency; we'd be training just as many parameters as the original weight matrix.

When $r = 1$: $A$ is a $1 \times k$ row vector and $B$ is a $d \times 1$ column vector. Their product $BA$ is a $d \times k$ matrix of rank exactly 1 — an outer product of two vectors. This is the most compressed possible LoRA adapter: the entire weight update for this layer is defined by just $d + k$ parameters. It can only shift the output in a single direction. This is extremely constrained, but for tasks that only need a minor adaptation (like adjusting bias in a classification head), rank 1 can sometimes suffice.

When $\alpha / r$ is very large: the LoRA update dominates the forward pass. The term $\frac{\alpha}{r} BA x$ overwhelms $W_0 x$, so the model effectively ignores its pre-trained knowledge. This defeats the purpose of LoRA — we want to make a small, targeted adjustment to the base model, not overwrite it.

When $\alpha / r \to 0$: the LoRA update vanishes and the model behaves exactly like the frozen base model. No adaptation happens regardless of how $A$ and $B$ are trained. In practice, $\alpha / r$ is set to a moderate value (often around 1–2) so the adapter contributes meaningfully without dominating.

The parameter savings are easy to compute. For a single weight matrix $W_0 \in \mathbb{R}^{d \times k}$, full fine-tuning requires learning $d \times k$ parameters. LoRA replaces this with $A$ ($r \times k$ parameters) plus $B$ ($d \times r$ parameters):

$$\text{LoRA params} = r \times (d + k)$$

Compare this to the $d \times k$ parameters of full fine-tuning. The ratio is:

$$\text{Compression ratio} = \frac{d \times k}{r \times (d + k)}$$

For a concrete example: take a typical attention projection in a large transformer where $d = k = 4096$ and we choose $r = 16$. Full fine-tuning requires $4096 \times 4096 = 16{,}777{,}216$ trainable parameters for this single matrix. LoRA requires $16 \times (4096 + 4096) = 131{,}072$ parameters — that's a $128\times$ reduction. For a model with hundreds of such matrices, the savings are enormous.

📌 The parameter count formula $r \times (d + k)$ is per matrix. When people report "LoRA uses 0.1% of parameters", they mean the total across all adapted layers divided by the total model size. A single layer's ratio depends on $r$ and the layer dimensions.

Initialization and Training

A natural question: how should we initialize $A$ and $B$? The answer reveals one of LoRA's cleverest design decisions.

Matrix $B$ is initialized to zero : every entry is 0. Matrix $A$ is initialized from a Gaussian distribution : $A \sim \mathcal{N}(0, \sigma^2)$ (typically Kaiming initialization). The order matters: because $B = 0$, the product $\Delta W = BA = 0 \cdot A = 0$ regardless of what $A$ contains. At the very start of training, the LoRA update contributes nothing. The model begins exactly where pre-training left off — no random perturbation, no initial shock.

Why is this so important? Consider the alternative: if both $A$ and $B$ were initialized randomly, then $\Delta W = BA$ would be a random matrix from the very first forward pass. The model's outputs would be corrupted before training has had a single chance to adjust anything. Loss would spike, and the first several gradient steps would be wasted just recovering from the random initialization noise. By starting from $\Delta W = 0$, LoRA guarantees that the first forward pass produces the same output as the original pre-trained model. Gradients from the very first batch are meaningful, and training is stable from step one.

💡 You might wonder: why zero $B$ and randomize $A$, rather than the reverse? Either choice gives $\Delta W = 0$ at init. The LoRA paper chose $B = 0$ and $A \sim \mathcal{N}(0, \sigma^2)$, but the reverse (or even both Gaussian with a learnable scalar starting at 0) would also work. What matters is that $\Delta W = 0$ at initialization.

During training, only $A$ and $B$ receive gradient updates. The pre-trained weight $W_0$ is frozen: no gradients are computed for it, and no optimizer states (momentum, variance for Adam) are allocated for it. This is where the memory savings come from. Recall from the previous article that full fine-tuning with Adam requires roughly 16 bytes per parameter (4 for the weight, 4 for the gradient, 4 for the first moment, 4 for the second moment). For a 7B model, that's about 112 GB just for the trainable state.

With LoRA, the frozen $W_0$ weights still occupy memory (about 2 bytes per parameter in half-precision), but they need no gradient buffer or optimizer states. Only the tiny $A$ and $B$ matrices need the full 16-byte-per-parameter treatment. If LoRA parameters are 0.1% of the total, the optimizer state drops from ~84 GB (for the 7B model's 12 bytes of gradient + momentum + variance per param) to under 100 MB. The base model weights still require ~14 GB in fp16, but the overall memory footprint is dramatically reduced compared to full fine-tuning.

Let's implement a minimal LoRA layer from scratch to see how this works in code. The class below wraps a frozen nn.Linear layer, adds trainable $A$ and $B$ matrices, and implements the LoRA forward pass:

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """A linear layer with a frozen base weight and a trainable low-rank adapter."""
    def __init__(self, base_layer: nn.Linear, r: int = 8, alpha: float = 8.0):
        super().__init__()
        self.base_layer = base_layer
        self.r = r
        self.alpha = alpha

        d, k = base_layer.out_features, base_layer.in_features

        # Freeze the base weight — no gradients, no optimizer states
        self.base_layer.weight.requires_grad = False
        if self.base_layer.bias is not None:
            self.base_layer.bias.requires_grad = False

        # A: down-projection (r x k), Gaussian init
        self.A = nn.Parameter(torch.randn(r, k) * 0.01)
        # B: up-projection (d x r), zero init
        self.B = nn.Parameter(torch.zeros(d, r))

    def forward(self, x):
        # Original frozen computation
        base_out = self.base_layer(x)
        # Low-rank bypass: (alpha/r) * x @ A^T @ B^T
        lora_out = (self.alpha / self.r) * (x @ self.A.T @ self.B.T)
        return base_out + lora_out

# Create a base linear layer (e.g., d=512, k=256)
base = nn.Linear(256, 512, bias=False)
lora = LoRALinear(base, r=8, alpha=8.0)

# Count parameters
total = sum(p.numel() for p in lora.parameters())
trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)
frozen = total - trainable

print(f"Base weight shape: {base.weight.shape}  (d=512, k=256)")
print(f"A shape: {lora.A.shape}  (r=8, k=256)")
print(f"B shape: {lora.B.shape}  (d=512, r=8)")
print(f"\nTotal parameters:     {total:,}")
print(f"Frozen (base W0):     {frozen:,}")
print(f"Trainable (A + B):    {trainable:,}")
print(f"Compression ratio:    {frozen / trainable:.1f}x")

# Verify that delta W = BA = 0 at initialization
delta_W = (lora.alpha / lora.r) * lora.B @ lora.A
print(f"\nDelta W at init (should be all zeros): max|BA| = {delta_W.abs().max().item():.6f}")

The output confirms that only $A$ and $B$ are trainable (6,144 parameters total), the base weight is frozen (131,072 parameters), and $\Delta W = BA = 0$ at initialization. In a real training loop, loss.backward() would compute gradients only for $A$ and $B$, and the optimizer would only allocate momentum and variance buffers for those 6,144 parameters.

Which Layers to Adapt?

LoRA can be applied to any linear layer in a transformer, but in practice, which layers should we target? Not all layers contribute equally to task adaptation, and targeting more layers means more trainable parameters, so there's a meaningful trade-off to consider.

The original LoRA paper experimented primarily with the attention projection matrices in each transformer block. A standard multi-head attention layer has four linear projections: $W_Q$ (queries), $W_K$ (keys), $W_V$ (values), and $W_O$ (output projection). Hu et al. found that adapting just $W_Q$ and $W_V$ was sufficient to match full fine-tuning performance on many NLP benchmarks, while leaving $W_K$ and $W_O$ frozen.

However, subsequent work and empirical practice have moved toward a broader set of targets. The HuggingFace PEFT library, for instance, commonly targets all linear layers in the attention block ($W_Q$, $W_K$, $W_V$, $W_O$) as well as the MLP layers (the feed-forward sub-layers, often called gate_proj , up_proj , and down_proj in LLaMA-style architectures). The general rule of thumb that has emerged:

💡 Targeting all linear layers with a lower rank (e.g., $r = 8$ across every layer) often outperforms targeting fewer layers with a higher rank (e.g., $r = 64$ on just $W_Q$ and $W_V$), even when both configurations use roughly the same total number of trainable parameters. Spreading the adapter budget across more layers lets the model make small adjustments everywhere, rather than large adjustments in a few places.

To make this concrete, consider a single transformer block in a LLaMA-7B-style model where $d_{\text{model}} = 4096$ and $d_{\text{ffn}} = 11008$. The following breakdown shows how the trainable parameter count grows as you add more target modules:

  • $W_Q, W_V$ only (original LoRA, $r = 16$): $2 \times 16 \times (4096 + 4096) = 262{,}144$ params per block
  • All attention ($W_Q, W_K, W_V, W_O$, $r = 16$): $4 \times 16 \times (4096 + 4096) = 524{,}288$ params per block
  • All attention + MLP ($r = 16$): attention contributes 524,288 and MLP adds $3 \times 16 \times (4096 + 11008) = 724{,}992$, totaling $1{,}249{,}280$ params per block

For a 32-layer model, the "all linear" configuration at $r = 16$ gives about 40M trainable parameters — still less than 0.6% of the 7B total. In practice, the choice often comes down to your compute budget and the complexity of the task: simple classification tasks may only need $W_Q$ and $W_V$ adaptation, while instruction-following or code generation typically benefits from adapting all linear layers.

Once you've chosen which layers to adapt and trained your LoRA adapters, the next question is what happens at inference time. Does the extra low-rank bypass slow things down? The answer is one of LoRA's most compelling features.

Merging: Zero-Cost Inference

One of the most elegant properties of LoRA is what happens at inference time. During training, the forward pass computes $W_0 x + \frac{\alpha}{r} BA x$ — two matrix multiplications instead of one, which adds some latency. But once training is finished, we can merge the adapter directly into the base weights:

$$W_{\text{merged}} = W_0 + \frac{\alpha}{r} B A$$

After merging, the model is a single weight matrix $W_{\text{merged}}$ with exactly the same shape as $W_0$. The forward pass reverts to a single matrix multiplication: $h = W_{\text{merged}} \, x$. There is zero additional inference latency — no adapter overhead, no extra computation, no separate adapter pathway. The deployed model is architecturally identical to the original base model. You could hand someone the merged weights and they would have no way to tell that LoRA was used during training.

This is a significant advantage over other parameter-efficient methods like adapters (which insert extra layers that add latency) or prompt tuning (which prepends extra tokens that increase sequence length). LoRA's merge property means the efficiency gains are purely at training time, with no cost at inference.

But merging isn't always what you want. In a multi-task serving scenario, you might have one base model and dozens of LoRA adapters — one fine-tuned for customer-support tone, another for legal document summarization, another for code generation. Instead of deploying a separate merged model for each task (which would multiply your GPU memory by the number of tasks), you can keep a single copy of the base model in memory and swap LoRA adapters per request. Loading an adapter means loading just the small $A$ and $B$ matrices, which at $r = 16$ for all layers might only be 20–40 MB — trivial compared to the multi-gigabyte base model.

💡 Frameworks like vLLM and LoRAX can serve hundreds of LoRA adapters on a single base model, dynamically routing each request to the appropriate adapter. This makes LoRA not just a training efficiency trick, but a deployment architecture.

The code below demonstrates merging with HuggingFace PEFT. After training, a single call to merge_and_unload() folds the adapters into the base weights and returns a standard model:

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model + trained LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
peft_model = PeftModel.from_pretrained(base_model, "path/to/lora-adapter")

# Merge adapter into base weights: W_merged = W0 + (alpha/r) * B @ A
merged_model = peft_model.merge_and_unload()

# merged_model is now a standard transformers model — no adapter overhead
# Save it as a regular model checkpoint
merged_model.save_pretrained("path/to/merged-model")

# Inference is identical to the base model: single matrix multiply per layer
outputs = merged_model.generate(input_ids, max_new_tokens=100)

After merge_and_unload() , the PEFT wrapper is removed entirely. The returned model is a plain transformers model with merged weights, ready for deployment with no PEFT dependency required at serving time.

LoRA in Practice with HuggingFace PEFT

Now let's put it all together with a realistic configuration using the HuggingFace PEFT library (Mangrulkar et al., 2022) . PEFT handles all the plumbing we built from scratch above: it freezes the base model, injects LoRA matrices into the specified layers, and makes sure only the adapter parameters are updated during training. The key configuration object is LoraConfig :

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

# 1. Load a pre-trained model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank — capacity of the adapter
    lora_alpha=32,                 # Scaling: alpha/r = 32/16 = 2.0
    target_modules=[               # Which layers get LoRA adapters
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,             # Dropout on the LoRA path for regularisation
    bias="none",                   # Don't train biases
    task_type=TaskType.CAUSAL_LM,  # Tells PEFT this is autoregressive LM
)

# 3. Wrap the model with LoRA adapters
peft_model = get_peft_model(model, lora_config)

# 4. Inspect trainable parameters
peft_model.print_trainable_parameters()
# => trainable params: 39,976,960 || all params: 6,778,957,824 || trainable%: 0.5897

Let's walk through each parameter in LoraConfig :

  • r=16 : the rank. Higher = more capacity = more trainable parameters. Start with 8 or 16 for most tasks; go up to 64 if you see underfitting.
  • lora_alpha=32 : the numerator of the $\alpha / r$ scaling factor. With $r = 16$, this gives $\alpha / r = 2.0$. A common recipe is $\alpha = 2r$.
  • target_modules=[...] : which layers receive LoRA adapters. The list above targets all linear layers in a LLaMA block. You can also pass "all-linear" as a shorthand in recent PEFT versions.
  • lora_dropout=0.05 : applies dropout to the LoRA path during training. This acts as regularisation, preventing the low-rank adapter from overfitting to small datasets. Set to 0 if you have abundant data.
  • bias="none" : controls whether bias terms are also trained. "none" means all biases stay frozen. You can set "all" or "lora_only" to additionally train biases.

Once the model is wrapped, training works exactly like standard PyTorch training. The optimizer only sees the LoRA parameters (since everything else has requires_grad=False ), so memory usage is drastically reduced. You can pass the wrapped model directly to HuggingFace's Trainer or SFTTrainer from the trl library:

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,            # LoRA often uses a higher LR than full fine-tuning
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

# Save only the adapter weights (tiny checkpoint: ~80 MB vs ~14 GB for full model)
peft_model.save_pretrained("./lora-adapter")

Notice the learning rate: 2e-4 is significantly higher than what you'd use for full fine-tuning (typically 1e-5 to 5e-5). LoRA adapters benefit from higher learning rates because the low-rank parameterization constrains the update space, making it harder to overfit even with aggressive steps. And when you save the adapter with save_pretrained() , only the $A$ and $B$ matrices are saved — typically 20–80 MB compared to the 14 GB base model. You can share adapters on the HuggingFace Hub and anyone with the base model can apply them.

📌 LoRA adapters are tied to the specific base model they were trained on. A LoRA adapter trained on LLaMA-2-7B cannot be applied to Mistral-7B, even though both have the same architecture and size. The adapter captures a delta relative to specific base weights, so the base must match exactly.

Quiz

Test your understanding of LoRA mechanics and design choices.

Why is matrix $B$ initialized to zero rather than randomly?

For a weight matrix $W_0 \in \mathbb{R}^{4096 \times 4096}$ with LoRA rank $r = 8$, how many trainable parameters does the LoRA adapter add?

What does the scaling factor $\alpha / r$ in the LoRA forward pass accomplish?

Why does LoRA add zero inference latency after merging?