Fusión de Modelos

¿Y Si Pudieras Combinar Modelos Fine-tuneados Sin Re-entrenar?

Imagine you've hacer fine-tuningado one LLaMA model to write Python code and another to answer medical questions. Each model is excellent at its specialty, but you want a single model that handles both. The obvious approach is to train a third model on a combined dataset, but that means collecting and curating all the data again, running another expensive training job, and hoping the model doesn't lose one skill while learning the other. What if you could just combine the weights of the two finished models and get a single model that inherits both capabilities?

Eso es exactamente what fusión de modelos does: it takes the parameter tensors of two or more hacer fine-tuningado models and combines them into one set of weights, with no additional training, no cálculo de gradiente, and no GPU time beyond loading and saving the checkpoints. The merged model is ready for inference immediately.

Esto suena como it shouldn't work. Neural network paisaje de pérdidas are famously non-convex, so averaging two random models should land you on a high-loss ridge. But there is a key constraint that makes merging viable: all models being merged must share the same architecture and, ideally, the same base model . When two models are hacer fine-tuningado from the same pre-trained checkpoint, they start at the same point in weight space and move relatively short distances during fine-tuning. The paisaje de pérdida between them turns out to be sorprendentemente smooth, forming a low-loss "valley" rather than a high-loss ridge. Interpolating between them stays inside that valley.

Esta observación was formalised in the Model Soups paper (Wortsman et al., 2022) , which showed that averaging the weights of multiple hacer fine-tuningado variants of the same base model consistently improves accuracy on held-out data, without increasing tiempo de inferencia or model size. La intuición es that each hacer fine-tuningado model overfits slightly to different patterns in its datos de entrenamiento, and averaging washes out the idiosyncratic noise while preserving the shared, generalizable signal, analogous to how ensembling predictions works but applied to weights instead.

💡 Model merging has become especially popular in the open-source community. Platforms like Hugging Face host hundreds of merged LLaMA variants, and some of the top models on community leaderboards are merges, not independently trained models. The appeal is clear: you get ensemble-like benefits at the cost of a single model at tiempo de inferencia.

Interpolación Lineal (LERP)

The simplest way to merge two models is a weighted average of their parameters. Given model A with weights $\theta_A$ and model B with weights $\theta_B$, we compute:

\theta_{\text{merged}} = (1 - \lambda) \cdot \theta_A + \lambda \cdot \theta_B

Here $\theta_A$ and $\theta_B$ represent the full set of parameters (every weight in every layer) of models A and B respectively, and $\lambda \in [0, 1]$ is a blending coefficient that controls how much each model contributes. The operation se aplica elemento a elemento: for every single number in every matriz de pesos and bias vector, we take the weighted average of the corresponding numbers from A and B.

Verifiquemos los casos límite cases. When $\lambda = 0$, we get $(1 - 0) \cdot \theta_A + 0 \cdot \theta_B = \theta_A$, so the merged model is exactly model A. When $\lambda = 1$, we get $0 \cdot \theta_A + 1 \cdot \theta_B = \theta_B$, so we recover model B exactly. At $\lambda = 0.5$, we get an equal average of both models. As $\lambda$ increases from 0 to 1, we smoothly slide along a straight line in weight space from A to B.

¿Por qué funciona esto when both models share a base? Fine-tuning from the same pre-trained checkpoint means both models start at the same point and move in similar (often nearly parallel) directions. The straight line between them stays in a low-loss region. But if A and B were trained from different random initialisations (different base models), the interpolación path may cross a high-loss ridge, because the two models live in entirely different basins of the paisaje de pérdida, and the merged weights end up in a region neither model explored during training.

LERP extends naturally to $K$ models with a set of non-negative weights that sum to one:

\theta_{\text{merged}} = \sum_{i=1}^{K} w_i \cdot \theta_i, \quad \text{where } \sum_{i=1}^{K} w_i = 1

Each $w_i$ controls how much model $i$ contributes. Equal weighting ($w_i = 1/K$ for all $i$) is the simplest choice and often a sorprendentemente strong baseline, because it equally washes out each model's idiosyncratic sobreajuste.

A more principled approach is aritmética de tareas (Ilharco et al., 2023) . Instead of averaging the full weights, we first extract the vector de tarea for each model: the change from the base model caused by fine-tuning.

\tau_A = \theta_A - \theta_{\text{base}}

The vector de tarea $\tau_A$ captures exactly what fine-tuning changed. It's a vector in weight space with the same dimensionality as $\theta_A$, where each element records how much that particular weight moved during training. We then build the merged model by starting from the base and adding scaled vector de tareas:

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda_A \cdot \tau_A + \lambda_B \cdot \tau_B

Here $\lambda_A$ and $\lambda_B$ are independent scaling coefficients (they don't need to sum to 1). This formulation has a powerful advantage over plain averaging: you can also subtract vector de tareas to remove capabilities. Setting $\lambda_A$ to a negative value pushes the model away from task A behavior. Por ejemplo, if $\tau_{\text{toxic}}$ captures the direction in weight space that makes a model produce toxic outputs, then $\theta_{\text{base}} - \lambda \cdot \tau_{\text{toxic}}$ yields a model that is steered away from that behavior. The larger the magnitude of $\lambda$, the stronger the repulsion. This is a remarkably cheap form of model editing: no retraining, no RLHF, just vector arithmetic on the weights.

💡 Task arithmetic only works well when the vector de tareas are relatively small compared to the base weights (which is typically the case with fine-tuning). If the models diverge too far from the base, the linear approximation breaks down and merging quality degrades.

The code below demonstrates both LERP and aritmética de tareas on small weight arrays, showing how the merged values sit between the originals and how subtraction pushes weights in the opposite direction.

import json, js

# Simulate small "models" as weight arrays
base   = [1.0, 2.0, 3.0, 4.0, 5.0]
modelA = [1.5, 2.3, 2.8, 4.6, 5.1]   # fine-tuned for task A
modelB = [0.8, 1.7, 3.5, 3.9, 5.8]   # fine-tuned for task B

# ---- LERP: weighted average ----
lam = 0.5
lerp = [(1 - lam) * a + lam * b for a, b in zip(modelA, modelB)]

# ---- Task arithmetic ----
tauA = [a - b for a, b in zip(modelA, base)]  # task vector A
tauB = [a - b for a, b in zip(modelB, base)]  # task vector B

# Add both task vectors (lambda=0.5 each)
merged_add = [b + 0.5 * ta + 0.5 * tb for b, ta, tb in zip(base, tauA, tauB)]

# Subtract task vector A (remove task A skill)
merged_sub = [b - 0.5 * ta for b, ta in zip(base, tauA)]

rows = []
for i in range(len(base)):
    rows.append([
        f"w[{i}]",
        f"{base[i]:.2f}",
        f"{modelA[i]:.2f}",
        f"{modelB[i]:.2f}",
        f"{tauA[i]:+.2f}",
        f"{tauB[i]:+.2f}",
        f"{lerp[i]:.2f}",
        f"{merged_add[i]:.2f}",
        f"{merged_sub[i]:.2f}"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["Param", "Base", "Model A", "Model B",
                "tau_A", "tau_B", "LERP(0.5)", "Add(0.5,0.5)", "Sub(-0.5*A)"],
    "rows": rows
})

print("LERP: weighted average of A and B (lambda=0.5)")
print("Add:  base + 0.5*tauA + 0.5*tauB")
print("Sub:  base - 0.5*tauA (pushes AWAY from task A)")
print()
print("Notice: LERP and Add give similar but not identical results.")
print("Sub pushes w[0] from 1.0 to 0.75 (opposite direction from A's 1.5).")

SLERP: Interpolación en una Hiperesfera

Linear interpolación has a subtle problem: it doesn't preserve the magnitude (norm) of the vectores de pesos. When two vectors point in different directions, the straight line between them cuts through the interior of the hypersphere, meaning the interpolated vector is shorter than either endpoint. In high-dimensional weight space, this can cause a "dip" in the effective scale of the weights at the midpoint, potentially degrading model quality.

Spherical Linear Interpolation (SLERP) solves this by interpolating along the surface of the hypersphere, following the shortest arc (geodesic) between the two vectors rather than the straight chord. The formula is:

\theta_{\text{merged}} = \frac{\sin((1-\lambda)\,\Omega)}{\sin \Omega} \cdot \theta_A + \frac{\sin(\lambda\,\Omega)}{\sin \Omega} \cdot \theta_B

where $\Omega$ is the angle between the two vectores de pesos:

\Omega = \arccos\!\left(\frac{\theta_A \cdot \theta_B}{\|\theta_A\|\;\|\theta_B\|}\right)

Descompongamos every piece. $\theta_A \cdot \theta_B$ is the dot product of the two vectores de pesos, $\|\theta_A\|$ and $\|\theta_B\|$ are their Euclidean norms (lengths), and the ratio is the cosine of the angle between them. The $\arccos$ converts this cosine into the actual angle $\Omega$ in radians. The two $\sin$ terms in the interpolación formula are carefully constructed so that the result follows the great-circle arc on the sphere's surface rather than cutting through its interior.

Check the boundaries. When $\lambda = 0$: the first coefficient becomes $\sin(\Omega)/\sin(\Omega) = 1$ and the second becomes $\sin(0)/\sin(\Omega) = 0$, so we get $\theta_A$ exactly. When $\lambda = 1$: the first coefficient is $\sin(0)/\sin(\Omega) = 0$ and the second is $\sin(\Omega)/\sin(\Omega) = 1$, so we get $\theta_B$ exactly, same endpoints as LERP. When $\Omega \to 0$ (the models are nearly identical, so the angle between their vectores de pesos is tiny), $\sin(x) \approx x$ for small $x$, and the SLERP formula reduces to $(1-\lambda) \cdot \theta_A + \lambda \cdot \theta_B$, which is exactly LERP. So SLERP gracefully degrades to LERP when the two models are close, and only differs meaningfully when they diverge in direction.

When $\Omega = \pi/2$ (vectors are orthogonal), SLERP gives a "balanced" arc path that maintains the full magnitude at every point along the interpolación, whereas LERP at the midpoint would produce a vector with magnitude $\|\theta_A\|/\sqrt{2}$, roughly a 30% reduction. For large language models where the scale of matrices de pesos affects layer activaciones, this magnitude preservation matters.

En la práctica, SLERP is not applied to the entire model as a single flattened vector. Instead, it se aplica per-layer or per-tensor : each matriz de pesos in the model is treated as its own vector and interpolated independently. This respects the fact that different layers may have different scales and different angular separations between the two models.

The code below computes both LERP and SLERP on simple 2D vectors so we can see how the arc path differs from the straight line. Observa cómo SLERP maintains a constant distance from the origin (constant norm) at every interpolación step, while LERP dips inward at the midpoint.

import math, json, js

def norm(v):
    return math.sqrt(sum(x*x for x in v))

def dot(a, b):
    return sum(x*y for x, y in zip(a, b))

def slerp(a, b, t):
    cos_omega = dot(a, b) / (norm(a) * norm(b))
    cos_omega = max(-1.0, min(1.0, cos_omega))  # numerical clamp
    omega = math.acos(cos_omega)
    if abs(omega) < 1e-8:  # nearly identical => fall back to LERP
        return [(1-t)*x + t*y for x, y in zip(a, b)]
    sin_omega = math.sin(omega)
    w1 = math.sin((1-t) * omega) / sin_omega
    w2 = math.sin(t * omega) / sin_omega
    return [w1*x + w2*y for x, y in zip(a, b)]

def lerp(a, b, t):
    return [(1-t)*x + t*y for x, y in zip(a, b)]

# Two 2D vectors at ~60 degrees apart
vecA = [3.0, 1.0]
vecB = [1.0, 3.0]

steps = [0.0, 0.25, 0.5, 0.75, 1.0]
rows = []
for t in steps:
    sl = slerp(vecA, vecB, t)
    lr = lerp(vecA, vecB, t)
    rows.append([
        f"{t:.2f}",
        f"({lr[0]:.3f}, {lr[1]:.3f})",
        f"{norm(lr):.3f}",
        f"({sl[0]:.3f}, {sl[1]:.3f})",
        f"{norm(sl):.3f}"
    ])

angle = math.degrees(math.acos(dot(vecA, vecB) / (norm(vecA) * norm(vecB))))
js.window.py_table_data = json.dumps({
    "headers": ["lambda", "LERP result", "LERP norm", "SLERP result", "SLERP norm"],
    "rows": rows
})

print(f"Vector A: {vecA}, norm = {norm(vecA):.3f}")
print(f"Vector B: {vecB}, norm = {norm(vecB):.3f}")
print(f"Angle between them: {angle:.1f} degrees")
print()
print("Both norms are {0:.3f} at endpoints.".format(norm(vecA)))
print("At lambda=0.5: LERP norm dips, SLERP norm stays constant.")

💡 SLERP can only merge exactly two models at a time (it interpolates between two points on a sphere). To merge three or more models, you either chain pairwise SLERP operations (merge A and B first, then merge the result with C) or switch to a different method like TIES or DARE that natively supports multiple models.

TIES-Merging: Manejando la Interferencia

When merging three or more models, a new problem emerges: interferencia . Model A might want a particular weight to increase (positive vector de tarea), while model B wants the same weight to decrease (negative vector de tarea). Averaging them cancels out both changes, and the merged model ends up at the base value, as if neither model had learned anything for that parameter. With many models, this destructive interferencia can silently erase the most important updates.

TIES-Merging (Yadav et al., 2023) addresses this directly. The name stands for Trim, Elect Sign, and Merge , and each step targets a specific failure mode of naive averaging.

Step 1: Trim. Most parameters change only slightly during fine-tuning. These small-magnitude updates are more noise than signal, and including them in the merge adds variance without meaningful information. TIES removes all task-vector entries whose absolute value falls below a threshold $k$, keeping only the top fraction by magnitude. Typically, practitioners keep the top 20% (i.e., $k$ is set to the 80th percentile of absolute values). At the boundaries: if $k = 0$, nothing is trimmed and we get a plain average. If $k$ is so large that everything is trimmed, we recover the base model unchanged.

Step 2: Elect Sign. After trimming, for each remaining parameter position, we look at the signs of the surviving task-vector entries across all models and take a voto de mayoría . If two out of three models want weight $j$ to increase (positive sign) and one wants it to decrease, the elected sign for position $j$ is positive. This resolves the interferencia: instead of averaging conflicting directions and getting zero, we commit to the direction that most models agree on.

Step 3: Merge. Finally, for each parameter, we average only the task-vector values that agree with the elected sign. Any model whose vector de tarea disagrees at that position is excluded from the average for that parameter. El resultado es a merged vector de tarea that preserves the intentional, consensus-driven updates and discards the conflicting noise.

The final merged model is constructed by adding this cleaned vector de tarea back to the base:

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \cdot \tau_{\text{TIES}}

where $\tau_{\text{TIES}}$ is the merged vector de tarea after trimming, sign election, and selective averaging, and $\lambda$ is a global scaling factor (typically between 0.5 and 1.0) that controls how strongly the merged updates are applied.

Why does this three-step procedure work better than simple averaging? Trimming removes the vast majority of parameters (80% are zeroed out), which eliminates the noise that would dilute the merge. Sign election resolves the directional conflicts that would otherwise cancel out meaningful updates. And selective averaging ensures that the surviving values all point in the same direction, so they reinforce rather than interfere with each other. Each step addresses a specific failure mode, and together they produce significativamente cleaner merges, especially as the number of models increases beyond two.

💡 The 80% trim ratio may seem aggressive, but recall from article 3 (LoRA) that fine-tuning updates have very low intrinsic dimensionality. Most individual parameter changes are negligible. Trimming them doesn't lose meaningful information; it removes the noise floor that would otherwise dilute the merge.

DARE: Poda Aleatoria Antes de Fusionar

TIES uses a deterministic threshold to prune small updates. DARE (Drop And REscale) takes a radically different approach: instead of trimming by magnitude, it randomly drops a large fraction of task-vector entries and rescales the survivors to compensate. The paper (Yu et al., 2024) has one of the more memorable titles in the field: "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch."

El procedimiento for each model's vector de tarea $\tau_i$ is:

\tilde{\tau}_i = \frac{1}{1-p} \cdot (m \odot \tau_i)

Descompongamos this down. $m$ is a binary mask of the same shape as $\tau_i$, where each entry is independently drawn: 1 with probability $(1-p)$ and 0 with probability $p$. The operator $\odot$ denotes elemento a elemento (Hadamard) multiplication, so $m \odot \tau_i$ zeros out a random fraction $p$ of the vector de tarea's entries. The factor $\frac{1}{1-p}$ rescales the surviving entries upward so that the expected sum of the pruned vector equals the sum of the original. If we drop 90% of entries ($p = 0.9$), the remaining 10% are each multiplied by $\frac{1}{1-0.9} = 10$.

This is conceptualmente identical to dropout , but applied to actualización de pesoss rather than activaciones. During training, dropout randomly silences neurons to prevent co-adaptation. DARE randomly silences weight changes to reduce interferencia between models during merging.

The boundary cases are instructive. When $p = 0$ (no dropping), the mask $m$ is all ones, the rescaling factor is $\frac{1}{1-0} = 1$, and we recover the original vector de tarea exactly, which gives us a standard merge. When $p = 0.9$, we drop 90% of all fine-tuning updates and inflate the remaining 10% by a factor of 10. This sounds extreme, but the paper shows it works remarkably well. The reason connects back to the intrinsic dimensionality argument from the LoRA article (article 3 of this track): the meaningful component of fine-tuning lives in a very low-dimensional subspace. Roughly 90% of individual parameter updates are redundant, so randomly dropping 90% of them and rescaling the rest aproximadamente preserves the net effect on model behaviour.

After applying DARE to each model's vector de tarea, the pruned-and-rescaled vectors $\tilde{\tau}_i$ are merged, often using TIES for the combination step. This gives us the best of both approaches: DARE reduces the density of updates (fewer non-zero entries means fewer chances for interferencia), and TIES resolves whatever conflicts remain. The merged model is then:

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \sum_{i=1}^{K} \tilde{\tau}_i

where the sum runs over all $K$ models and $\lambda$ is a global scaling coefficient.

💡 The DARE paper reports that $p = 0.9$ (dropping 90%) is optimal across a range of tasks and model sizes. This is a striking empirical result: nine out of every ten parameter changes from fine-tuning can be discarded at random without meaningfully hurting the merged model's performance.

Fusión en la Práctica con mergekit

All of the methods above, LERP, SLERP, aritmética de tareas, TIES, and DARE, are implemented in mergekit (Charles Goddard et al., 2024) , the standard open-source toolkit for fusión de modelos. It takes a YAML configuration file that specifies which models to merge, which method to use, and what parameters to apply, then produces a merged checkpoint you can load directly into Hugging Face Transformers.

Aquí está a SLERP configuration for merging two models. SLERP is the recommended default for two-model merges because it preserves weight magnitudes:

# slerp_merge.yml
merge_method: slerp
slices:
  - sources:
      - model: model_A_path    # e.g. "username/llama-code"
        layer_range: [0, 32]
      - model: model_B_path    # e.g. "username/llama-medical"
        layer_range: [0, 32]
parameters:
  t:                           # interpolation parameter (lambda)
    - filter: self_attn        # attention layers: favour model A
      value: 0.4
    - filter: mlp              # MLP layers: favour model B
      value: 0.6
    - value: 0.5               # everything else: equal blend
base_model: model_A_path
dtype: float16

Observa que mergekit lets you set different interpolación weights per layer type . This is powerful because different layers may carry different specialisations. Attention layers might encode more of one model's reasoning patterns while MLP layers store more factual knowledge, so weighting them differently can produce better merges than a uniform $\lambda$.

For merging three or more models, TIES with DARE is a strong default. Aquí está an example configuration:

# dare_ties_merge.yml
merge_method: dare_ties
base_model: base_model_path    # e.g. "meta-llama/Llama-2-7b-hf"
models:
  - model: model_code_path     # code fine-tune
    parameters:
      weight: 0.4              # contribution weight
      density: 0.1             # DARE: keep 10% of updates (p=0.9)
  - model: model_medical_path  # medical fine-tune
    parameters:
      weight: 0.3
      density: 0.1
  - model: model_chat_path     # chat fine-tune
    parameters:
      weight: 0.3
      density: 0.1
parameters:
  int_methods: TIES            # use TIES for sign resolution
  normalize: true              # normalise weights to sum to 1
dtype: float16

The density parameter controls DARE's drop rate: density: 0.1 means keep 10% of updates (drop 90%), matching the $p = 0.9$ recommendation from the DARE paper. The weight parameter sets each model's contribution to the final merge.

Running the merge is a single command:

mergekit-yaml dare_ties_merge.yml ./merged_model --cuda

El checkpoint fusionado appears in ./merged_model , ready to load with AutoModelForCausalLM.from_pretrained("./merged_model") .

Here are practical tips distilled from the community's experience with fusión de modelos:

Always evaluate merged models. Merging can degrade quality in subtle ways. Run your standard benchmarks and a few manual test prompts before deploying. A merge that looks good on paper might hallucinate more or lose seguimiento de instrucciones ability.
SLERP for two models, TIES or DARE for three or more. SLERP's magnitude preservation makes it the best default for pairwise merges. For multi-model merges, TIES and DARE handle interferencia that LERP cannot.
Start with equal weights, then tune. Equal weighting is a sorprendentemente strong baseline. If one model is clearly stronger, increase its weight incrementally and evaluate after each change.
Same base model is critical. Merging models hacer fine-tuningado from different base architectures or even different ejecución de entrenamientos of the same architecture almost always fails. The shared initialisation is what makes the paisaje de pérdida smooth enough for interpolación to work.
Check the community. The Hugging Face Open LLM Leaderboard features many merged models, and practitioners share their merge configurations openly. Studying successful merges is one of the fastest ways to develop intuition for what works.

Model merging completes our survey of the fine-tuning toolbox. Together with full fine-tuning (article 2), de eficiencia de parámetros methods like LoRA and QLoRA (articles 3-5), data preparation strategies (article 6), and evaluation methods (article 9), you now have a comprehensive toolkit for adapting foundation models to any task. Merging adds a uniquely powerful capability: the ability to combine multiple specialised models into one, for free, without any additional training. In a field where GPU hours are expensive and hacer fine-tuningado models are abundant, that is a remarkably practical superpower.

Quiz

Test your understanding of fusión de modelos methods.

Why does linear interpolación (LERP) between two hacer fine-tuningado models work well when they share the same base model?

Because the models have the same number of parameters

Because fine-tuning from the same checkpoint keeps both models in a smooth, low-loss region of weight space where the interpolación path doesn't cross high-loss ridges

Because the tasa de aprendizaje is always small enough to prevent divergence

Because LERP normalises the weights to unit length before averaging

What advantage does SLERP have over LERP for fusión de modelos?

SLERP can merge more than two models at once

SLERP requires no hiperparámetros

SLERP interpolates along the surface of a hypersphere, preserving the magnitude (norm) of the vectores de pesos instead of letting it dip at the midpoint

SLERP is computacionalmente cheaper because it avoids multiplication

In TIES-Merging, what is the purpose of the "Elect Sign" step?

To determine which model contributed the most parameters

To resolve interferencia by taking a voto de mayoría on whether each parameter should increase or decrease, so conflicting updates don't cancel each other out

To convert all vector de tareas to positive values before averaging

To select which layers should be merged and which should be kept from the base model

DARE randomly drops 90% of fine-tuning updates and rescales the remaining 10% by a factor of 10. Why does this aggressive pruning not destroy model quality?

Because the rescaling factor exactly compensates for the dropped values in all cases

Because DARE only drops parameters from the embedding layer, which is redundant

Because fine-tuning updates have very low intrinsic dimensionality, meaning roughly 90% of individual parameter changes are redundant and carry no meaningful signal

Because the dropped parameters are always the ones with the smallest gradients