Model Merging - Cruxr.ai

What If You Could Combine Fine-tuned Models Without Retraining?

Imagine you've fine-tuned one LLaMA model to write Python code and another to answer medical questions. Each model is excellent at its specialty, but you want a single model that handles both. The obvious approach is to train a third model on a combined dataset, but that means collecting and curating all the data again, running another expensive training job, and hoping the model doesn't lose one skill while learning the other. What if you could just combine the weights of the two finished models and get a single model that inherits both capabilities?

That is exactly what model merging does: it takes the parameter tensors of two or more fine-tuned models and combines them into one set of weights, with no additional training, no gradient computation, and no GPU time beyond loading and saving the checkpoints. The merged model is ready for inference immediately.

This sounds like it shouldn't work. Neural network loss landscapes are famously non-convex, so averaging two random models should land you on a high-loss ridge. But there is a key constraint that makes merging viable: all models being merged must share the same architecture and, ideally, the same base model . When two models are fine-tuned from the same pre-trained checkpoint, they start at the same point in weight space and move relatively short distances during fine-tuning. The loss landscape between them turns out to be surprisingly smooth, forming a low-loss "valley" rather than a high-loss ridge. Interpolating between them stays inside that valley.

This observation was formalised in the Model Soups paper (Wortsman et al., 2022) , which showed that averaging the weights of multiple fine-tuned variants of the same base model consistently improves accuracy on held-out data, without increasing inference time or model size. The intuition is that each fine-tuned model overfits slightly to different patterns in its training data, and averaging washes out the idiosyncratic noise while preserving the shared, generalizable signal, analogous to how ensembling predictions works but applied to weights instead.

💡 Model merging has become especially popular in the open-source community. Platforms like Hugging Face host hundreds of merged LLaMA variants, and some of the top models on community leaderboards are merges, not independently trained models. The appeal is clear: you get ensemble-like benefits at the cost of a single model at inference time.

Linear Interpolation (LERP)

The simplest way to merge two models is a weighted average of their parameters. Given model A with weights $\theta_A$ and model B with weights $\theta_B$, we compute:

\theta_{\text{merged}} = (1 - \lambda) \cdot \theta_A + \lambda \cdot \theta_B

Here $\theta_A$ and $\theta_B$ represent the full set of parameters (every weight in every layer) of models A and B respectively, and $\lambda \in [0, 1]$ is a blending coefficient that controls how much each model contributes. The operation is applied element-wise: for every single number in every weight matrix and bias vector, we take the weighted average of the corresponding numbers from A and B.

Let's check the boundary cases. When $\lambda = 0$, we get $(1 - 0) \cdot \theta_A + 0 \cdot \theta_B = \theta_A$, so the merged model is exactly model A. When $\lambda = 1$, we get $0 \cdot \theta_A + 1 \cdot \theta_B = \theta_B$, so we recover model B exactly. At $\lambda = 0.5$, we get an equal average of both models. As $\lambda$ increases from 0 to 1, we smoothly slide along a straight line in weight space from A to B.

Why does this work when both models share a base? Fine-tuning from the same pre-trained checkpoint means both models start at the same point and move in similar (often nearly parallel) directions. The straight line between them stays in a low-loss region. But if A and B were trained from different random initialisations (different base models), the interpolation path may cross a high-loss ridge, because the two models live in entirely different basins of the loss landscape, and the merged weights end up in a region neither model explored during training.

LERP extends naturally to $K$ models with a set of non-negative weights that sum to one:

\theta_{\text{merged}} = \sum_{i=1}^{K} w_i \cdot \theta_i, \quad \text{where } \sum_{i=1}^{K} w_i = 1

Each $w_i$ controls how much model $i$ contributes. Equal weighting ($w_i = 1/K$ for all $i$) is the simplest choice and often a surprisingly strong baseline, because it equally washes out each model's idiosyncratic overfitting.

A more principled approach is task arithmetic (Ilharco et al., 2023) . Instead of averaging the full weights, we first extract the task vector for each model: the change from the base model caused by fine-tuning.

\tau_A = \theta_A - \theta_{\text{base}}

The task vector $\tau_A$ captures exactly what fine-tuning changed. It's a vector in weight space with the same dimensionality as $\theta_A$, where each element records how much that particular weight moved during training. We then build the merged model by starting from the base and adding scaled task vectors:

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda_A \cdot \tau_A + \lambda_B \cdot \tau_B

Here $\lambda_A$ and $\lambda_B$ are independent scaling coefficients (they don't need to sum to 1). This formulation has a powerful advantage over plain averaging: you can also subtract task vectors to remove capabilities. Setting $\lambda_A$ to a negative value pushes the model away from task A behavior. For example, if $\tau_{\text{toxic}}$ captures the direction in weight space that makes a model produce toxic outputs, then $\theta_{\text{base}} - \lambda \cdot \tau_{\text{toxic}}$ yields a model that is steered away from that behavior. The larger the magnitude of $\lambda$, the stronger the repulsion. This is a remarkably cheap form of model editing: no retraining, no RLHF, just vector arithmetic on the weights.

💡 Task arithmetic only works well when the task vectors are relatively small compared to the base weights (which is typically the case with fine-tuning). If the models diverge too far from the base, the linear approximation breaks down and merging quality degrades.

The code below demonstrates both LERP and task arithmetic on small weight arrays, showing how the merged values sit between the originals and how subtraction pushes weights in the opposite direction.

import json, js

# Simulate small "models" as weight arrays
base   = [1.0, 2.0, 3.0, 4.0, 5.0]
modelA = [1.5, 2.3, 2.8, 4.6, 5.1]   # fine-tuned for task A
modelB = [0.8, 1.7, 3.5, 3.9, 5.8]   # fine-tuned for task B

# ---- LERP: weighted average ----
lam = 0.5
lerp = [(1 - lam) * a + lam * b for a, b in zip(modelA, modelB)]

# ---- Task arithmetic ----
tauA = [a - b for a, b in zip(modelA, base)]  # task vector A
tauB = [a - b for a, b in zip(modelB, base)]  # task vector B

# Add both task vectors (lambda=0.5 each)
merged_add = [b + 0.5 * ta + 0.5 * tb for b, ta, tb in zip(base, tauA, tauB)]

# Subtract task vector A (remove task A skill)
merged_sub = [b - 0.5 * ta for b, ta in zip(base, tauA)]

rows = []
for i in range(len(base)):
    rows.append([
        f"w[{i}]",
        f"{base[i]:.2f}",
        f"{modelA[i]:.2f}",
        f"{modelB[i]:.2f}",
        f"{tauA[i]:+.2f}",
        f"{tauB[i]:+.2f}",
        f"{lerp[i]:.2f}",
        f"{merged_add[i]:.2f}",
        f"{merged_sub[i]:.2f}"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["Param", "Base", "Model A", "Model B",
                "tau_A", "tau_B", "LERP(0.5)", "Add(0.5,0.5)", "Sub(-0.5*A)"],
    "rows": rows
})

print("LERP: weighted average of A and B (lambda=0.5)")
print("Add:  base + 0.5*tauA + 0.5*tauB")
print("Sub:  base - 0.5*tauA (pushes AWAY from task A)")
print()
print("Notice: LERP and Add give similar but not identical results.")
print("Sub pushes w[0] from 1.0 to 0.75 (opposite direction from A's 1.5).")

SLERP: Interpolation on a Hypersphere

Linear interpolation has a subtle problem: it doesn't preserve the magnitude (norm) of the weight vectors. When two vectors point in different directions, the straight line between them cuts through the interior of the hypersphere, meaning the interpolated vector is shorter than either endpoint. In high-dimensional weight space, this can cause a "dip" in the effective scale of the weights at the midpoint, potentially degrading model quality.

Spherical Linear Interpolation (SLERP) solves this by interpolating along the surface of the hypersphere, following the shortest arc (geodesic) between the two vectors rather than the straight chord. The formula is:

\theta_{\text{merged}} = \frac{\sin((1-\lambda)\,\Omega)}{\sin \Omega} \cdot \theta_A + \frac{\sin(\lambda\,\Omega)}{\sin \Omega} \cdot \theta_B

where $\Omega$ is the angle between the two weight vectors:

\Omega = \arccos\!\left(\frac{\theta_A \cdot \theta_B}{\|\theta_A\|\;\|\theta_B\|}\right)

Let's unpack every piece. $\theta_A \cdot \theta_B$ is the dot product of the two weight vectors, $\|\theta_A\|$ and $\|\theta_B\|$ are their Euclidean norms (lengths), and the ratio is the cosine of the angle between them. The $\arccos$ converts this cosine into the actual angle $\Omega$ in radians. The two $\sin$ terms in the interpolation formula are carefully constructed so that the result follows the great-circle arc on the sphere's surface rather than cutting through its interior.

Check the boundaries. When $\lambda = 0$: the first coefficient becomes $\sin(\Omega)/\sin(\Omega) = 1$ and the second becomes $\sin(0)/\sin(\Omega) = 0$, so we get $\theta_A$ exactly. When $\lambda = 1$: the first coefficient is $\sin(0)/\sin(\Omega) = 0$ and the second is $\sin(\Omega)/\sin(\Omega) = 1$, so we get $\theta_B$ exactly, same endpoints as LERP. When $\Omega \to 0$ (the models are nearly identical, so the angle between their weight vectors is tiny), $\sin(x) \approx x$ for small $x$, and the SLERP formula reduces to $(1-\lambda) \cdot \theta_A + \lambda \cdot \theta_B$, which is exactly LERP. So SLERP gracefully degrades to LERP when the two models are close, and only differs meaningfully when they diverge in direction.

When $\Omega = \pi/2$ (vectors are orthogonal), SLERP gives a "balanced" arc path that maintains the full magnitude at every point along the interpolation, whereas LERP at the midpoint would produce a vector with magnitude $\|\theta_A\|/\sqrt{2}$, roughly a 30% reduction. For large language models where the scale of weight matrices affects layer activations, this magnitude preservation matters.

In practice, SLERP is not applied to the entire model as a single flattened vector. Instead, it is applied per-layer or per-tensor : each weight matrix in the model is treated as its own vector and interpolated independently. This respects the fact that different layers may have different scales and different angular separations between the two models.

The code below computes both LERP and SLERP on simple 2D vectors so we can see how the arc path differs from the straight line. Notice how SLERP maintains a constant distance from the origin (constant norm) at every interpolation step, while LERP dips inward at the midpoint.

import math, json, js

def norm(v):
    return math.sqrt(sum(x*x for x in v))

def dot(a, b):
    return sum(x*y for x, y in zip(a, b))

def slerp(a, b, t):
    cos_omega = dot(a, b) / (norm(a) * norm(b))
    cos_omega = max(-1.0, min(1.0, cos_omega))  # numerical clamp
    omega = math.acos(cos_omega)
    if abs(omega) < 1e-8:  # nearly identical => fall back to LERP
        return [(1-t)*x + t*y for x, y in zip(a, b)]
    sin_omega = math.sin(omega)
    w1 = math.sin((1-t) * omega) / sin_omega
    w2 = math.sin(t * omega) / sin_omega
    return [w1*x + w2*y for x, y in zip(a, b)]

def lerp(a, b, t):
    return [(1-t)*x + t*y for x, y in zip(a, b)]

# Two 2D vectors at ~60 degrees apart
vecA = [3.0, 1.0]
vecB = [1.0, 3.0]

steps = [0.0, 0.25, 0.5, 0.75, 1.0]
rows = []
for t in steps:
    sl = slerp(vecA, vecB, t)
    lr = lerp(vecA, vecB, t)
    rows.append([
        f"{t:.2f}",
        f"({lr[0]:.3f}, {lr[1]:.3f})",
        f"{norm(lr):.3f}",
        f"({sl[0]:.3f}, {sl[1]:.3f})",
        f"{norm(sl):.3f}"
    ])

angle = math.degrees(math.acos(dot(vecA, vecB) / (norm(vecA) * norm(vecB))))
js.window.py_table_data = json.dumps({
    "headers": ["lambda", "LERP result", "LERP norm", "SLERP result", "SLERP norm"],
    "rows": rows
})

print(f"Vector A: {vecA}, norm = {norm(vecA):.3f}")
print(f"Vector B: {vecB}, norm = {norm(vecB):.3f}")
print(f"Angle between them: {angle:.1f} degrees")
print()
print("Both norms are {0:.3f} at endpoints.".format(norm(vecA)))
print("At lambda=0.5: LERP norm dips, SLERP norm stays constant.")

💡 SLERP can only merge exactly two models at a time (it interpolates between two points on a sphere). To merge three or more models, you either chain pairwise SLERP operations (merge A and B first, then merge the result with C) or switch to a different method like TIES or DARE that natively supports multiple models.

TIES-Merging: Handling Interference

When merging three or more models, a new problem emerges: interference . Model A might want a particular weight to increase (positive task vector), while model B wants the same weight to decrease (negative task vector). Averaging them cancels out both changes, and the merged model ends up at the base value, as if neither model had learned anything for that parameter. With many models, this destructive interference can silently erase the most important updates.

TIES-Merging (Yadav et al., 2023) addresses this directly. The name stands for Trim, Elect Sign, and Merge , and each step targets a specific failure mode of naive averaging.

Step 1: Trim. Most parameters change only slightly during fine-tuning. These small-magnitude updates are more noise than signal, and including them in the merge adds variance without meaningful information. TIES removes all task-vector entries whose absolute value falls below a threshold $k$, keeping only the top fraction by magnitude. Typically, practitioners keep the top 20% (i.e., $k$ is set to the 80th percentile of absolute values). At the boundaries: if $k = 0$, nothing is trimmed and we get a plain average. If $k$ is so large that everything is trimmed, we recover the base model unchanged.

Step 2: Elect Sign. After trimming, for each remaining parameter position, we look at the signs of the surviving task-vector entries across all models and take a majority vote . If two out of three models want weight $j$ to increase (positive sign) and one wants it to decrease, the elected sign for position $j$ is positive. This resolves the interference: instead of averaging conflicting directions and getting zero, we commit to the direction that most models agree on.

Step 3: Merge. Finally, for each parameter, we average only the task-vector values that agree with the elected sign. Any model whose task vector disagrees at that position is excluded from the average for that parameter. The result is a merged task vector that preserves the intentional, consensus-driven updates and discards the conflicting noise.

The final merged model is constructed by adding this cleaned task vector back to the base:

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \cdot \tau_{\text{TIES}}

where $\tau_{\text{TIES}}$ is the merged task vector after trimming, sign election, and selective averaging, and $\lambda$ is a global scaling factor (typically between 0.5 and 1.0) that controls how strongly the merged updates are applied.

Why does this three-step procedure work better than simple averaging? Trimming removes the vast majority of parameters (80% are zeroed out), which eliminates the noise that would dilute the merge. Sign election resolves the directional conflicts that would otherwise cancel out meaningful updates. And selective averaging ensures that the surviving values all point in the same direction, so they reinforce rather than interfere with each other. Each step addresses a specific failure mode, and together they produce significantly cleaner merges, especially as the number of models increases beyond two.

💡 The 80% trim ratio may seem aggressive, but recall from article 3 (LoRA) that fine-tuning updates have very low intrinsic dimensionality. Most individual parameter changes are negligible. Trimming them doesn't lose meaningful information; it removes the noise floor that would otherwise dilute the merge.

DARE: Random Pruning Before Merging

TIES uses a deterministic threshold to prune small updates. DARE (Drop And REscale) takes a radically different approach: instead of trimming by magnitude, it randomly drops a large fraction of task-vector entries and rescales the survivors to compensate. The paper (Yu et al., 2024) has one of the more memorable titles in the field: "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch."

The procedure for each model's task vector $\tau_i$ is:

\tilde{\tau}_i = \frac{1}{1-p} \cdot (m \odot \tau_i)

Let's break this down. $m$ is a binary mask of the same shape as $\tau_i$, where each entry is independently drawn: 1 with probability $(1-p)$ and 0 with probability $p$. The operator $\odot$ denotes element-wise (Hadamard) multiplication, so $m \odot \tau_i$ zeros out a random fraction $p$ of the task vector's entries. The factor $\frac{1}{1-p}$ rescales the surviving entries upward so that the expected sum of the pruned vector equals the sum of the original. If we drop 90% of entries ($p = 0.9$), the remaining 10% are each multiplied by $\frac{1}{1-0.9} = 10$.

This is conceptually identical to dropout , but applied to weight updates rather than activations. During training, dropout randomly silences neurons to prevent co-adaptation. DARE randomly silences weight changes to reduce interference between models during merging.

The boundary cases are instructive. When $p = 0$ (no dropping), the mask $m$ is all ones, the rescaling factor is $\frac{1}{1-0} = 1$, and we recover the original task vector exactly, which gives us a standard merge. When $p = 0.9$, we drop 90% of all fine-tuning updates and inflate the remaining 10% by a factor of 10. This sounds extreme, but the paper shows it works remarkably well. The reason connects back to the intrinsic dimensionality argument from the LoRA article (article 3 of this track): the meaningful component of fine-tuning lives in a very low-dimensional subspace. Roughly 90% of individual parameter updates are redundant, so randomly dropping 90% of them and rescaling the rest approximately preserves the net effect on model behaviour.

After applying DARE to each model's task vector, the pruned-and-rescaled vectors $\tilde{\tau}_i$ are merged, often using TIES for the combination step. This gives us the best of both approaches: DARE reduces the density of updates (fewer non-zero entries means fewer chances for interference), and TIES resolves whatever conflicts remain. The merged model is then:

\theta_{\text{merged}} = \theta_{\text{base}} + \lambda \sum_{i=1}^{K} \tilde{\tau}_i

where the sum runs over all $K$ models and $\lambda$ is a global scaling coefficient.

💡 The DARE paper reports that $p = 0.9$ (dropping 90%) is optimal across a range of tasks and model sizes. This is a striking empirical result: nine out of every ten parameter changes from fine-tuning can be discarded at random without meaningfully hurting the merged model's performance.

Merging in Practice with mergekit

All of the methods above, LERP, SLERP, task arithmetic, TIES, and DARE, are implemented in mergekit (Charles Goddard et al., 2024) , the standard open-source toolkit for model merging. It takes a YAML configuration file that specifies which models to merge, which method to use, and what parameters to apply, then produces a merged checkpoint you can load directly into Hugging Face Transformers.

Here is a SLERP configuration for merging two models. SLERP is the recommended default for two-model merges because it preserves weight magnitudes:

# slerp_merge.yml
merge_method: slerp
slices:
  - sources:
      - model: model_A_path    # e.g. "username/llama-code"
        layer_range: [0, 32]
      - model: model_B_path    # e.g. "username/llama-medical"
        layer_range: [0, 32]
parameters:
  t:                           # interpolation parameter (lambda)
    - filter: self_attn        # attention layers: favour model A
      value: 0.4
    - filter: mlp              # MLP layers: favour model B
      value: 0.6
    - value: 0.5               # everything else: equal blend
base_model: model_A_path
dtype: float16

Notice that mergekit lets you set different interpolation weights per layer type . This is powerful because different layers may carry different specialisations. Attention layers might encode more of one model's reasoning patterns while MLP layers store more factual knowledge, so weighting them differently can produce better merges than a uniform $\lambda$.

For merging three or more models, TIES with DARE is a strong default. Here is an example configuration:

# dare_ties_merge.yml
merge_method: dare_ties
base_model: base_model_path    # e.g. "meta-llama/Llama-2-7b-hf"
models:
  - model: model_code_path     # code fine-tune
    parameters:
      weight: 0.4              # contribution weight
      density: 0.1             # DARE: keep 10% of updates (p=0.9)
  - model: model_medical_path  # medical fine-tune
    parameters:
      weight: 0.3
      density: 0.1
  - model: model_chat_path     # chat fine-tune
    parameters:
      weight: 0.3
      density: 0.1
parameters:
  int_methods: TIES            # use TIES for sign resolution
  normalize: true              # normalise weights to sum to 1
dtype: float16

The density parameter controls DARE's drop rate: density: 0.1 means keep 10% of updates (drop 90%), matching the $p = 0.9$ recommendation from the DARE paper. The weight parameter sets each model's contribution to the final merge.

Running the merge is a single command:

mergekit-yaml dare_ties_merge.yml ./merged_model --cuda

The merged checkpoint appears in ./merged_model , ready to load with AutoModelForCausalLM.from_pretrained("./merged_model") .

Here are practical tips distilled from the community's experience with model merging:

Always evaluate merged models. Merging can degrade quality in subtle ways. Run your standard benchmarks and a few manual test prompts before deploying. A merge that looks good on paper might hallucinate more or lose instruction-following ability.
SLERP for two models, TIES or DARE for three or more. SLERP's magnitude preservation makes it the best default for pairwise merges. For multi-model merges, TIES and DARE handle interference that LERP cannot.
Start with equal weights, then tune. Equal weighting is a surprisingly strong baseline. If one model is clearly stronger, increase its weight incrementally and evaluate after each change.
Same base model is critical. Merging models fine-tuned from different base architectures or even different training runs of the same architecture almost always fails. The shared initialisation is what makes the loss landscape smooth enough for interpolation to work.
Check the community. The Hugging Face Open LLM Leaderboard features many merged models, and practitioners share their merge configurations openly. Studying successful merges is one of the fastest ways to develop intuition for what works.

Model merging completes our survey of the fine-tuning toolbox. Together with full fine-tuning (article 2), parameter-efficient methods like LoRA and QLoRA (articles 3-5), data preparation strategies (article 6), and evaluation methods (article 9), you now have a comprehensive toolkit for adapting foundation models to any task. Merging adds a uniquely powerful capability: the ability to combine multiple specialised models into one, for free, without any additional training. In a field where GPU hours are expensive and fine-tuned models are abundant, that is a remarkably practical superpower.

Quiz

Test your understanding of model merging methods.

Why does linear interpolation (LERP) between two fine-tuned models work well when they share the same base model?

Because the models have the same number of parameters

Because fine-tuning from the same checkpoint keeps both models in a smooth, low-loss region of weight space where the interpolation path doesn't cross high-loss ridges

Because the learning rate is always small enough to prevent divergence

Because LERP normalises the weights to unit length before averaging

What advantage does SLERP have over LERP for model merging?

SLERP can merge more than two models at once

SLERP requires no hyperparameters

SLERP interpolates along the surface of a hypersphere, preserving the magnitude (norm) of the weight vectors instead of letting it dip at the midpoint

SLERP is computationally cheaper because it avoids multiplication

In TIES-Merging, what is the purpose of the "Elect Sign" step?

To determine which model contributed the most parameters

To resolve interference by taking a majority vote on whether each parameter should increase or decrease, so conflicting updates don't cancel each other out

To convert all task vectors to positive values before averaging

To select which layers should be merged and which should be kept from the base model

DARE randomly drops 90% of fine-tuning updates and rescales the remaining 10% by a factor of 10. Why does this aggressive pruning not destroy model quality?

Because the rescaling factor exactly compensates for the dropped values in all cases

Because DARE only drops parameters from the embedding layer, which is redundant

Because fine-tuning updates have very low intrinsic dimensionality, meaning roughly 90% of individual parameter changes are redundant and carry no meaningful signal

Because the dropped parameters are always the ones with the smallest gradients