¿Por Qué Son Redundantes la Mayoría de las Actualizaciones de Pesos?

En el artículo anterior we saw that full fine-tuning updates every single parameter in the model. For a 7-billion-parameter LLM, that means computing and storing gradients, estados del optimizador, and actualización de pesoss for all 7 billion numbers. Pero aquí está el question that opens the door to LoRA: do we actually need all of those updates? When a modelo pre-entrenado adapts to a new task, are all 7 billion dimensions of change contributing useful information, or is most of that update just noise?

La respuesta resulta ser striking. (Aghajanyan et al., 2020) studied the intrinsic dimensionality of fine-tuning and found that the effective change during adaptation lives in a remarkably low-dimensional subspace. Their key finding: a model with 100 million parameters might only need around 200 dimensions of change to adapt to a new task. That's not 200 million, or even 200 thousand — just 200. Even though descenso de gradiente updates millions of parameters during fine-tuning, the meaningful part of that update can be captured in a tiny fraction of the full parameter space.

Piensa en what this means in matrix terms. If we call the actualización de pesos $\Delta W$ (the difference between the hacer fine-tuningado matriz de pesos and the original pre-trained matriz de pesos), this finding tells us that $\Delta W$ has low rank . The rank of a matrix measures how many linearly independent "directions" it encodes. A matriz de pesos with shape $4096 \times 4096$ could in principle have rank 4096 (a full-rank matrix, where every direction is unique). But the intrinsic dimensionality hypothesis says that in practice, $\Delta W$ might have an effective rank of 8, 16, or 32 — the remaining thousands of directions are redundant or noise.

If $\Delta W$ is genuinely low-rank, then we should be able to decompose it into a product of two much smaller matrices without losing any meaningful information. And that's precisely the insight behind LoRA (Low-Rank Adaptation) : instead of learning the full $\Delta W$ and hoping it ends up low-rank, we force it to be low-rank from the start by parameterizing it as the product of two thin matrices. El resultado: we get nearly the same adapted model, but train orders of magnitude fewer parameters.

💡 The intrinsic dimensionality finding doesn't just apply to small models. Aghajanyan et al. showed that larger modelo pre-entrenados actually have lower intrinsic dimensionality for downstream tasks, meaning they are even better candidates for low-rank adaptation. The better the pre-entrenamiento, the less you need to change.

La Descomposición de LoRA

The LoRA paper (Hu et al., 2021) introduces a clean, elegant idea. Instead of updating a pre-trained matriz de pesos $W_0$ directly (which would require storing and optimizing a full $d \times k$ matrix of updates), we freeze $W_0$ completely and add a low-rank bypass alongside it. The paso forward becomes:

$$h = W_0 x + \frac{\alpha}{r} B A x$$

Every symbol in this equation matters, so let's unpack each one carefully.

$x \in \mathbb{R}^k$ is the input vector to this layer — it could be a token embedding or the output of a previous layer. Its dimension is $k$, matching the input dimension of the matriz de pesos.

$h \in \mathbb{R}^d$ is the output of the layer after the LoRA modification. In the original model, $h = W_0 x$. With LoRA, the output is $h = W_0 x + \frac{\alpha}{r} BA x$ — the base model's computation plus a low-rank correction term.

$W_0 \in \mathbb{R}^{d \times k}$ is the frozen pre-trained matriz de pesos, where $d$ is the output dimension and $k$ is the input dimension. "Frozen" means these weights do not receive actualizaciones de gradiente during fine-tuning. They are loaded from the pre-trained checkpoint and left untouched. The term $W_0 x$ is just the original model's computation — exactly what the layer would have done before any fine-tuning.

$A \in \mathbb{R}^{r \times k}$ is the proyección de reducción matrix. It takes the input $x$ (a vector of dimension $k$) and compresses it down to $r$ dimensions. You can think of $A$ as asking: "of the $k$ input features, which $r$ linear combinations are most relevant for the task adaptation?" Este es el cuello de botella — all the information about the update must flow through just $r$ channels.

$B \in \mathbb{R}^{d \times r}$ is the proyección de expansión matrix. It takes the compressed $r$-dimensional representation and projects it back up to $d$ dimensions, producing an output that can be added to $W_0 x$. Together, $A$ compresses and $B$ expands, forming a cuello de botella architecture similar to an autoencoder.

$r$ is the rank — the single most important hiperparámetro in LoRA. It controls how expressive the low-rank update can be. Typical values are $r \in \{4, 8, 16, 32, 64\}$. The product $BA \in \mathbb{R}^{d \times k}$ has the same shape as $W_0$, but its rank is at most $r$. Since $r \ll \min(d, k)$, this product can only represent a thin slice of all possible $d \times k$ matrices — exactly the low-dimensional subspace we believe the update lives in.

$\alpha$ is a scaling constant (typically set once and left fixed). The ratio $\alpha / r$ controls the magnitude of the LoRA update relative to the base model. Why divide by $r$? Because without this normalization, doubling $r$ would roughly double the magnitude of the update $BA x$, forcing you to also halve the tasa de aprendizaje to compensate. The $\alpha / r$ scaling decouples capacity (controlled by $r$) from magnitude (controlled by $\alpha$ and the tasa de aprendizaje), so you can sweep over $r$ without retuning the tasa de aprendizaje each time.

💡 A common convention is to set $\alpha = r$ (so $\alpha / r = 1$) or $\alpha = 2r$. When $\alpha = r$, the scaling factor drops out entirely and the update is simply $BA x$. The HuggingFace PEFT library defaults to $\alpha = 8$.

Ahora walk through the boundary cases to build intuition for what $r$ actually controls.

When $r = \min(d, k)$: the product $BA$ can represent any $d \times k$ matrix (it is no longer constrained to be low-rank). In this limit, LoRA is equivalent to full fine-tuning — the adapter has enough capacity to express any possible actualización de pesos. Of course, this also means we've gained nothing in parameter efficiency; we'd be training just as many parameters as the original matriz de pesos.

When $r = 1$: $A$ is a $1 \times k$ row vector and $B$ is a $d \times 1$ column vector. Their product $BA$ is a $d \times k$ matrix of rank exactly 1 — an outer product of two vectors. Este es el most compressed possible LoRA adapter: the entire actualización de pesos for this layer is defined by just $d + k$ parameters. It can only shift the output in a single direction. This is extremely constrained, but for tasks that only need a minor adaptation (like adjusting bias in a classification head), rank 1 can sometimes suffice.

When $\alpha / r$ is very large: the LoRA update dominates the paso forward. The term $\frac{\alpha}{r} BA x$ overwhelms $W_0 x$, so the model efectivamente ignores its pre-trained knowledge. This defeats the purpose of LoRA — we want to make a small, targeted adjustment to the base model, not overwrite it.

When $\alpha / r \to 0$: the LoRA update vanishes and the model behaves exactly like the modelo base congelado. No adaptation happens regardless of how $A$ and $B$ se entrenan. En la práctica, $\alpha / r$ is set to a moderate value (often around 1–2) so the adapter contributes meaningfully without dominating.

El ahorro de parámetros are easy to compute. For a single matriz de pesos $W_0 \in \mathbb{R}^{d \times k}$, full fine-tuning requires learning $d \times k$ parameters. LoRA replaces this with $A$ ($r \times k$ parameters) plus $B$ ($d \times r$ parameters):

$$\text{LoRA params} = r \times (d + k)$$

Compara esto to the $d \times k$ parameters of full fine-tuning. The ratio is:

$$\text{Compression ratio} = \frac{d \times k}{r \times (d + k)}$$

Para un ejemplo concreto: take a typical proyección de atención in a large transformer where $d = k = 4096$ and we choose $r = 16$. Full fine-tuning requires $4096 \times 4096 = 16{,}777{,}216$ parámetros entrenables for this single matrix. LoRA requires $16 \times (4096 + 4096) = 131{,}072$ parameters — that's a $128\times$ reduction. For a model with hundreds of such matrices, the savings are enormous.

📌 La cuenta de parámetros formula $r \times (d + k)$ is per matrix. When people report "LoRA uses 0.1% of parameters", they mean the total across all adapted layers divided by the total model size. A single layer's ratio depends on $r$ and the layer dimensions.

Inicialización y Entrenamiento

A natural question: how should we initialize $A$ and $B$? The answer reveals one of LoRA's cleverest design decisions.

Matrix $B$ is initialized to zero : every entry is 0. Matrix $A$ is initialized from a Gaussian distribution : $A \sim \mathcal{N}(0, \sigma^2)$ (typically Kaiming initialization). The order matters: because $B = 0$, the product $\Delta W = BA = 0 \cdot A = 0$ regardless of what $A$ contains. At the very start of training, the LoRA update contributes nothing. The model begins exactly where pre-entrenamiento left off — no random perturbation, no initial shock.

¿Por qué es esto tan importante? Consider the alternative: if both $A$ and $B$ were initialized randomly, then $\Delta W = BA$ would be a random matrix from the very first paso forward. The model's outputs would be corrupted before training has had a single chance to adjust anything. Loss would spike, and the first several gradient steps would be wasted just recovering from the random initialization noise. By starting from $\Delta W = 0$, LoRA guarantees that the first paso forward produces the same output as the original modelo pre-entrenado. Gradients from the very first batch are meaningful, and training is stable from step one.

💡 You might wonder: why zero $B$ and randomize $A$, rather than the reverse? Either choice gives $\Delta W = 0$ at init. The LoRA paper chose $B = 0$ and $A \sim \mathcal{N}(0, \sigma^2)$, but the reverse (or even both Gaussian with a learnable scalar starting at 0) would also work. What matters is that $\Delta W = 0$ at initialization.

During training, only $A$ and $B$ receive actualizaciones de gradiente. The peso pre-entrenado $W_0$ está congelado: no gradients are computed for it, and no estados del optimizador (momentum, variance for Adam) are allocated for it. Aquí es donde the ahorro de memoria come from. Recall from the previous article that full fine-tuning with Adam requires roughly 16 bytes per parameter (4 for the weight, 4 for the gradient, 4 for the first moment, 4 for the second moment). For a 7B model, that's about 112 GB just for the trainable state.

With LoRA, the frozen $W_0$ weights still occupy memory (about 2 bytes per parameter in half-precision), but they need no gradient buffer or estados del optimizador. Only the tiny $A$ and $B$ matrices need the full 16-byte-per-parameter treatment. If LoRA parameters are 0.1% of the total, the estado del optimizador drops from ~84 GB (for the 7B model's 12 bytes of gradient + momentum + variance per param) to under 100 MB. The base model weights still require ~14 GB in fp16, but the overall huella de memoria is dramáticamente reduced compared to full fine-tuning.

Let's implement a minimal LoRA layer from scratch to see how this works in code. The class below wraps a frozen nn.Linear layer, adds trainable $A$ and $B$ matrices, and implements the LoRA paso forward:

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """A linear layer with a frozen base weight and a trainable low-rank adapter."""
    def __init__(self, base_layer: nn.Linear, r: int = 8, alpha: float = 8.0):
        super().__init__()
        self.base_layer = base_layer
        self.r = r
        self.alpha = alpha

        d, k = base_layer.out_features, base_layer.in_features

        # Freeze the base weight — no gradients, no optimizer states
        self.base_layer.weight.requires_grad = False
        if self.base_layer.bias is not None:
            self.base_layer.bias.requires_grad = False

        # A: down-projection (r x k), Gaussian init
        self.A = nn.Parameter(torch.randn(r, k) * 0.01)
        # B: up-projection (d x r), zero init
        self.B = nn.Parameter(torch.zeros(d, r))

    def forward(self, x):
        # Original frozen computation
        base_out = self.base_layer(x)
        # Low-rank bypass: (alpha/r) * x @ A^T @ B^T
        lora_out = (self.alpha / self.r) * (x @ self.A.T @ self.B.T)
        return base_out + lora_out

# Create a base linear layer (e.g., d=512, k=256)
base = nn.Linear(256, 512, bias=False)
lora = LoRALinear(base, r=8, alpha=8.0)

# Count parameters
total = sum(p.numel() for p in lora.parameters())
trainable = sum(p.numel() for p in lora.parameters() if p.requires_grad)
frozen = total - trainable

print(f"Base weight shape: {base.weight.shape}  (d=512, k=256)")
print(f"A shape: {lora.A.shape}  (r=8, k=256)")
print(f"B shape: {lora.B.shape}  (d=512, r=8)")
print(f"\nTotal parameters:     {total:,}")
print(f"Frozen (base W0):     {frozen:,}")
print(f"Trainable (A + B):    {trainable:,}")
print(f"Compression ratio:    {frozen / trainable:.1f}x")

# Verify that delta W = BA = 0 at initialization
delta_W = (lora.alpha / lora.r) * lora.B @ lora.A
print(f"\nDelta W at init (should be all zeros): max|BA| = {delta_W.abs().max().item():.6f}")

La salida confirma that only $A$ and $B$ are trainable (6,144 parameters total), the base weight está congelado (131,072 parameters), and $\Delta W = BA = 0$ at initialization. In a real bucle de entrenamiento, loss.backward() would compute gradients only for $A$ and $B$, and the optimizer would only allocate momentum and variance buffers for those 6,144 parameters.

¿Qué Capas Adaptar?

LoRA can be applied to any linear layer in a transformer, but in practice, which layers should we target? Not all layers contribute equally to task adaptation, and targeting more layers means more parámetros entrenables, so there's a meaningful trade-off to consider.

The original LoRA paper experimented primarily with the attention projection matrices in each bloque transformer. A standard atención multi-cabeza layer has four linear projections: $W_Q$ (queries), $W_K$ (keys), $W_V$ (values), and $W_O$ (output projection). Hu et al. found that adapting just $W_Q$ and $W_V$ was sufficient to match full fine-tuning performance on many NLP benchmarks, while leaving $W_K$ and $W_O$ frozen.

Sin embargo, subsequent work and empirical practice have moved toward a broader set of targets. The HuggingFace PEFT library, for instance, commonly targets all linear layers in the bloque de atención ($W_Q$, $W_K$, $W_V$, $W_O$) as well as the MLP layers (the feed-forward sub-layers, often called gate_proj , up_proj , and down_proj in LLaMA-style architectures). The general rule of thumb that has emerged:

💡 Targeting all linear layers with a lower rank (e.g., $r = 8$ across every layer) often outperforms targeting fewer layers with a higher rank (e.g., $r = 64$ on just $W_Q$ and $W_V$), even when both configurations use roughly the same total number of parámetros entrenables. Spreading the adapter budget across more layers lets the model make small adjustments everywhere, rather than large adjustments in a few places.

To make this concrete, consider a single bloque transformer in a LLaMA-7B-style model where $d_{\text{model}} = 4096$ and $d_{\text{ffn}} = 11008$. The following breakdown shows how the parámetro entrenable count grows as you add more target modules:

  • $W_Q, W_V$ only (original LoRA, $r = 16$): $2 \times 16 \times (4096 + 4096) = 262{,}144$ params per block
  • All attention ($W_Q, W_K, W_V, W_O$, $r = 16$): $4 \times 16 \times (4096 + 4096) = 524{,}288$ params per block
  • All attention + MLP ($r = 16$): attention contributes 524,288 and MLP adds $3 \times 16 \times (4096 + 11008) = 724{,}992$, totaling $1{,}249{,}280$ params per block

For a 32-layer model, the "all linear" configuration at $r = 16$ gives about 40M parámetros entrenables — still less than 0.6% of the 7B total. En la práctica, the choice often comes down to your compute budget and the complexity of the task: simple classification tasks may only need $W_Q$ and $W_V$ adaptation, while seguimiento de instrucciones or code generation typically benefits from adapting all linear layers.

Once you've chosen which layers to adapt and trained your LoRA adapters, the next question is what happens at tiempo de inferencia. Does the extra low-rank bypass slow things down? La respuesta es one of LoRA's most compelling features.

Fusión: Inferencia sin Costo Adicional

One of the most elegant properties of LoRA is what happens at tiempo de inferencia. During training, the paso forward computes $W_0 x + \frac{\alpha}{r} BA x$ — two matrix multiplications instead of one, which adds some latency. But once training is finished, we can merge the adapter directly into the base weights:

$$W_{\text{merged}} = W_0 + \frac{\alpha}{r} B A$$

After merging, the model is a single matriz de pesos $W_{\text{merged}}$ with exactly the same shape as $W_0$. The paso forward reverts to a single matrix multiplication: $h = W_{\text{merged}} \, x$. There is zero additional latencia de inferencia — no adapter overhead, no extra computation, no separate adapter pathway. The deployed model is arquitectónicamente identical to the original base model. You could hand someone the merged weights and they would have no way to tell that LoRA was used during training.

This is a significant advantage over other de eficiencia de parámetros methods like adapters (which insert extra layers that add latency) or prompt tuning (which prepends extra tokens that increase longitud de secuencia). LoRA's merge property means the efficiency gains are purely at training time, with no cost at inference.

But merging isn't always what you want. In a multi-task serving scenario, you might have one base model and dozens of LoRA adapters — one hacer fine-tuningado for customer-support tone, another for legal document summarization, another for code generation. Instead of deploying a separate merged model for each task (which would multiply your memoria de GPU by the number of tasks), you can keep a single copy of the base model in memory and swap LoRA adapters per request. Loading an adapter means loading just the small $A$ and $B$ matrices, which at $r = 16$ for all layers might only be 20–40 MB — trivial compared to the multi-gigabyte base model.

💡 Frameworks like vLLM and LoRAX can serve hundreds of LoRA adapters on a single base model, dynamically routing each request to the appropriate adapter. This makes LoRA not just a training efficiency trick, but a deployment architecture.

The code below demonstrates merging with HuggingFace PEFT. After training, a single call to merge_and_unload() folds the adapters into the base weights and returns a standard model:

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model + trained LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
peft_model = PeftModel.from_pretrained(base_model, "path/to/lora-adapter")

# Merge adapter into base weights: W_merged = W0 + (alpha/r) * B @ A
merged_model = peft_model.merge_and_unload()

# merged_model is now a standard transformers model — no adapter overhead
# Save it as a regular model checkpoint
merged_model.save_pretrained("path/to/merged-model")

# Inference is identical to the base model: single matrix multiply per layer
outputs = merged_model.generate(input_ids, max_new_tokens=100)

After merge_and_unload() , the PEFT wrapper is removed entirely. The returned model is a plain transformers model with merged weights, ready for deployment with no PEFT dependency required at serving time.

LoRA en la Práctica con HuggingFace PEFT

Ahora put it all together with a realistic configuration using the HuggingFace PEFT library (Mangrulkar et al., 2022) . PEFT handles all the plumbing we built from scratch above: it freezes the base model, injects LoRA matrices into the specified layers, and makes sure only the adapter parameters are updated during training. The key configuration object is LoraConfig :

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

# 1. Load a pre-trained model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank — capacity of the adapter
    lora_alpha=32,                 # Scaling: alpha/r = 32/16 = 2.0
    target_modules=[               # Which layers get LoRA adapters
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,             # Dropout on the LoRA path for regularisation
    bias="none",                   # Don't train biases
    task_type=TaskType.CAUSAL_LM,  # Tells PEFT this is autoregressive LM
)

# 3. Wrap the model with LoRA adapters
peft_model = get_peft_model(model, lora_config)

# 4. Inspect trainable parameters
peft_model.print_trainable_parameters()
# => trainable params: 39,976,960 || all params: 6,778,957,824 || trainable%: 0.5897

Recorramos each parameter in LoraConfig :

  • r=16 : the rank. Higher = more capacity = more parámetros entrenables. Start with 8 or 16 for most tasks; go up to 64 if you see subajuste.
  • lora_alpha=32 : the numerator of the $\alpha / r$ scaling factor. With $r = 16$, this gives $\alpha / r = 2.0$. A common recipe is $\alpha = 2r$.
  • target_modules=[...] : which layers receive LoRA adapters. The list above targets all linear layers in a LLaMA block. You can also pass "all-linear" as a shorthand in recent PEFT versions.
  • lora_dropout=0.05 : applies dropout to the LoRA path during training. This acts as regularización, preventing the low-rank adapter from sobreajuste to small datasets. Set to 0 if you have abundant data.
  • bias="none" : controls whether bias terms are also trained. "none" means all biases stay frozen. You can set "all" or "lora_only" to additionally train biases.

Once the model is wrapped, training works exactly like standard PyTorch training. The optimizer only sees the LoRA parameters (since everything else has requires_grad=False ), so uso de memoria is drastically reduced. You can pass the wrapped model directly to HuggingFace's Trainer or SFTTrainer from the trl library:

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,            # LoRA often uses a higher LR than full fine-tuning
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

# Save only the adapter weights (tiny checkpoint: ~80 MB vs ~14 GB for full model)
peft_model.save_pretrained("./lora-adapter")

Notice the tasa de aprendizaje: 2e-4 is significativamente higher than what you'd use for full fine-tuning (typically 1e-5 to 5e-5). LoRA adapters benefit from higher tasa de aprendizajes because the low-rank parameterization constrains the update space, making it harder to overfit even with aggressive steps. And when you save the adapter with save_pretrained() , only the $A$ and $B$ matrices are saved — typically 20–80 MB compared to the 14 GB base model. You can share adapters on the HuggingFace Hub and anyone with the base model can apply them.

📌 LoRA adapters are tied to the specific base model they were trained on. A LoRA adapter trained on LLaMA-2-7B cannot be applied to Mistral-7B, even though both have the same architecture and size. The adapter captures a delta relative to specific base weights, so the base must match exactly.

Quiz

Test your understanding of LoRA mechanics and design choices.

Why is matrix $B$ initialized to zero rather than randomly?

For a matriz de pesos $W_0 \in \mathbb{R}^{4096 \times 4096}$ with LoRA rank $r = 8$, how many parámetros entrenables does the LoRA adapter add?

What does the scaling factor $\alpha / r$ in the LoRA paso forward accomplish?

Why does LoRA add zero latencia de inferencia after merging?