Skip to main content

Merging Techniques Deep Dive

You've decided to merge your adapters rather than retrain. Now comes the technical question: how exactly do you combine model weights?

The simplest approach—averaging—often works surprisingly well. But when capabilities conflict or adapters are highly specialized, naive averaging degrades both. This lesson explores the major merging techniques: Linear, SLERP, TIES, and DARE. You'll understand not just what each does, but why it works and when to use it.

The Problem: Parameter Interference

Before diving into solutions, understand the problem they solve.

When you fine-tune two adapters on different tasks, they adjust overlapping parameters in different directions:

Base Model Weight (Layer 5, Neuron 42): 0.15

Persona Adapter Change: +0.08 → New value: 0.23
Agentic Adapter Change: -0.05 → New value: 0.10

What should the merged value be? Options:

  • Average: (0.23 + 0.10) / 2 = 0.165
  • Keep persona: 0.23
  • Keep agentic: 0.10
  • Something else?

Simple averaging gives 0.165—but this represents neither capability well. The persona wanted to increase this weight; the agentic wanted to decrease it. Averaging creates a compromise that satisfies neither.

This is parameter interference: overlapping modifications that cancel or distort each other when combined naively.

Why Interference Matters

Not all parameters experience interference:

Interference TypeFrequencyImpact
No overlap~70% of parametersBoth changes preserved perfectly
Same direction~15% of parametersChanges reinforce (good)
Opposite direction~15% of parametersChanges cancel (bad)

The ~15% of conflicting parameters can significantly harm capability. Merging techniques differ in how they handle this minority of problematic parameters.

Technique 1: Linear Interpolation (Baseline)

The simplest approach: weighted average of all parameters.

How It Works

# Conceptually:
merged_weight = alpha * adapter_1_weight + (1 - alpha) * adapter_2_weight

# Example with alpha = 0.5 (equal weighting):
merged = 0.5 * persona_value + 0.5 * agentic_value

For task vectors (differences from base model):

merged_delta = alpha * delta_1 + (1 - alpha) * delta_2
merged_model = base_model + merged_delta

Strengths

StrengthWhy It Matters
SimpleEasy to understand, debug, implement
FastSingle pass through parameters
AdjustableAlpha controls blend ratio

Weaknesses

WeaknessImpact
Ignores conflictsAveraging destroys opposing changes
Dilutes strong signalsImportant changes get reduced
No redundancy handlingKeeps unnecessary parameter changes

When to Use

  • Quick baseline to see if merging is viable
  • Adapters trained on very similar data (few conflicts expected)
  • One adapter dominates (set alpha = 0.8 or higher)

MergeKit Configuration

merge_method: linear
slices:
- sources:
- model: ./persona_adapter
layer_range: [0, 32]
parameters:
weight: 0.5
- model: ./agentic_adapter
layer_range: [0, 32]
parameters:
weight: 0.5
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16

Technique 2: SLERP (Spherical Linear Interpolation)

SLERP interpolates vectors along the surface of a sphere rather than in a straight line.

The Intuition

Imagine model weights as a point on a high-dimensional sphere. Linear interpolation moves directly between two points—but this path cuts through the interior of the sphere, reducing magnitude.

SLERP stays on the sphere's surface, preserving the "size" of the weight vector.

          Linear (cuts through)
●─────────────────●
/ \
/ \
●─────────────────────●
SLERP (follows curve)

Why Magnitude Matters

Research suggests that weight magnitude encodes learned strength. Reducing magnitude through linear interpolation may weaken learned behaviors. SLERP preserves magnitude while smoothly transitioning between models.

When SLERP Helps

SLERP matters most when:

  • Combining exactly two models
  • Models have similar training (same task, different random seeds)
  • You want smooth interpolation without magnitude loss

Strengths

StrengthWhy It Matters
Preserves magnitudeLearned strength maintained
Smooth interpolationContinuous transition between models
Geometric propertiesRespects high-dimensional structure

Weaknesses

WeaknessImpact
Only two modelsCannot merge 3+ models directly
Still averages conflictsSame sign-conflict problem as linear
Computationally heavierRequires trigonometric operations

MergeKit Configuration

merge_method: slerp
slices:
- sources:
- model: ./persona_adapter
layer_range: [0, 32]
- model: ./agentic_adapter
layer_range: [0, 32]
parameters:
t: 0.5 # Interpolation factor (0 = first model, 1 = second)
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16

Technique 3: TIES-Merging (Trim, Elect, Sign)

TIES (Trim, Elect Signs, and Merge) directly addresses parameter interference through a three-step process.

Step 1: Trim

Most fine-tuning changes are small—essentially noise. TIES trims parameters with small magnitude changes:

Before Trim:
Parameter 1: +0.002 (tiny change)
Parameter 2: +0.150 (significant change)
Parameter 3: -0.001 (tiny change)
Parameter 4: -0.200 (significant change)

After Trim (keep top 50%):
Parameter 1: 0 (trimmed)
Parameter 2: +0.150 (kept)
Parameter 3: 0 (trimmed)
Parameter 4: -0.200 (kept)

This focuses the merge on parameters that actually matter.

Step 2: Elect Signs

For parameters where adapters disagree on direction, TIES "votes" to elect a consensus sign:

Parameter 42:
Adapter 1: +0.08 (positive)
Adapter 2: -0.05 (negative)
Adapter 3: +0.12 (positive)

Elected sign: POSITIVE (2 vs 1)

The winning sign determines the final direction.

Step 3: Merge with Resolved Signs

After electing signs, TIES averages only the values that agree with the elected sign:

Parameter 42 (elected sign: POSITIVE):
Adapter 1: +0.08 → include
Adapter 2: -0.05 → exclude
Adapter 3: +0.12 → include

Merged value: (0.08 + 0.12) / 2 = 0.10

Why TIES Works

By eliminating noise (trim) and resolving conflicts (elect), TIES produces cleaner merged parameters:

ProblemTIES Solution
Noise in fine-tuningTrimmed away
Sign conflictsResolved by voting
Magnitude dilutionOnly agreed values averaged

Strengths

StrengthWhy It Matters
Handles conflictsSign election resolves interference
Reduces noiseTrimming focuses on important changes
Scales to N modelsCan merge 3+ adapters at once

Weaknesses

WeaknessImpact
Loses minority signalsMinority sign gets discarded
Requires density tuningTrim percentage affects results
More complexHarder to debug than linear

MergeKit Configuration

merge_method: ties
slices:
- sources:
- model: ./persona_adapter
layer_range: [0, 32]
- model: ./agentic_adapter
layer_range: [0, 32]
parameters:
weight: 0.5 # Per-model weight
density: 0.5 # Keep top 50% of changes (trim rest)
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16

Tuning Density

The density parameter controls how aggressively TIES trims:

DensityEffectUse When
0.3Very aggressive trimAdapters share many redundant changes
0.5Balanced (default)General purpose
0.7Light trimAdapters have distinct, important changes
1.0No trim (TIES sign election only)Want conflict resolution but keep all values

Technique 4: DARE (Drop And REscale)

DARE takes an even more aggressive approach: randomly drop most parameter changes, then rescale the survivors.

The Core Insight

Research found that fine-tuned model capabilities are surprisingly robust. You can drop 90% or even 99% of parameter changes and still preserve most functionality. The remaining 10% captures the essential learning.

How It Works

Step 1: Drop

Randomly set most delta parameters to zero:

Before Drop (drop_rate = 0.9):
delta_1: +0.08
delta_2: +0.15
delta_3: -0.05
delta_4: +0.20
delta_5: -0.12

After Drop (random selection):
delta_1: 0 (dropped)
delta_2: +0.15 (kept)
delta_3: 0 (dropped)
delta_4: 0 (dropped)
delta_5: -0.12 (kept)

Step 2: Rescale

Multiply surviving values to compensate for dropped values:

Rescale factor: 1 / (1 - drop_rate) = 1 / 0.1 = 10

Rescaled:
delta_2: +0.15 * 10 = +1.50
delta_5: -0.12 * 10 = -1.20

Why Rescaling Works

The intuition: if you keep 10% of changes, those changes need to do 10x the work to approximate the original effect. Rescaling maintains the expected magnitude of the aggregate change.

DARE + TIES

MergeKit offers dare_ties: DARE's dropping combined with TIES's sign election. This handles:

  • Redundancy (DARE's aggressive dropping)
  • Conflicts (TIES's sign election)

Strengths

StrengthWhy It Matters
Extreme compressionModel capabilities from tiny parameter changes
Reduces interferenceFewer non-zero parameters = fewer conflicts
Fast mergingSparse operations are efficient

Weaknesses

WeaknessImpact
Random dropsResults vary run-to-run
Risk of capability lossToo aggressive dropping hurts
Requires tuningDrop rate is critical

MergeKit Configuration

merge_method: dare_ties
slices:
- sources:
- model: ./persona_adapter
layer_range: [0, 32]
- model: ./agentic_adapter
layer_range: [0, 32]
parameters:
weight: 0.5
density: 0.3 # Keep only 30% of parameters (drop 70%)
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16

Technique Comparison Summary

TechniqueCore MechanismBest ForHandles Conflicts?RAM Efficient?
LinearWeighted averageQuick baselineNoYes
SLERPSpherical interpolation2 similar modelsNoYes
TIESTrim + Sign electionDistinct capabilitiesYesYes
DAREDrop + RescaleAggressive compressionPartialYes
DARE-TIESDrop + Sign electionBest of bothYesYes

Choosing Your Technique: Decision Tree

┌────────────────────────────────────────────────────────┐
│ How many models? │
└───────────────────────┬────────────────────────────────┘

┌───────────────┴───────────────┐
│ │
┌────▼────┐ ┌─────▼─────┐
│ TWO │ │ THREE+ │
└────┬────┘ └─────┬─────┘
│ │
Are they similar? Use TIES or DARE-TIES
│ │
┌────┴────┐ │
│ │ ▼
┌──▼──┐ ┌──▼──┐ (SLERP not available)
│ YES │ │ NO │
└──┬──┘ └──┬──┘
│ │
▼ ▼
SLERP TIES or DARE-TIES

For two distinct adapters (like persona + agentic):

  1. Start with TIES (density=0.5)
  2. Evaluate both capabilities
  3. If one capability dominates, adjust weights
  4. If both weak, try lower density or DARE-TIES
  5. Compare to linear baseline to ensure TIES helps

Applying to Task API Adapters

For your TaskMaster persona + agentic tool-calling merge:

Recommendation: TIES

Rationale:

  • Two distinct capabilities (persona style vs. structured output)
  • Likely parameter conflicts (both modify generation behavior)
  • Both capabilities are critical (can't sacrifice either)

Starting configuration:

merge_method: ties
slices:
- sources:
- model: ./task_api_persona_adapter
layer_range: [0, 32]
- model: ./task_api_agentic_adapter
layer_range: [0, 32]
parameters:
weight: 0.5
density: 0.5
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16

If persona traits are weak post-merge, try:

  • Increase persona weight to 0.6
  • Decrease density to 0.3 (more aggressive conflict resolution)

If tool-calling accuracy drops, try:

  • Increase agentic weight to 0.6
  • Try DARE-TIES (compression may help structure)

Try With AI

Use your AI companion to deepen your understanding of these techniques.

Prompt 1: Visualize the Differences

I've read about Linear, SLERP, TIES, and DARE merging techniques.
Help me build intuition by walking through a concrete example:

Imagine two adapters that both modify the same 10 parameters.
Show me what happens to those 10 parameters under each technique:
1. Some parameters where adapters agree
2. Some where they disagree (opposite signs)
3. Some where one has a large change, the other small

Walk through step-by-step how each technique handles each case.
Use simple numbers I can follow mentally.

What you're learning: Concrete understanding—moving from abstract descriptions to intuitive grasp through worked examples.

Prompt 2: Diagnose My Merge Failure

I merged two adapters and the result is bad. Help me diagnose:

Merged model behavior:
- Persona traits are barely visible
- Tool-calling works but feels "generic"
- Neither capability is as strong as the individual adapters

My merge config:
merge_method: linear
weight: 0.5

What went wrong? What technique should I try instead?
Walk me through the reasoning, not just the answer.

What you're learning: Diagnostic reasoning—using technique knowledge to troubleshoot real merge failures.

Prompt 3: Update My Skill

After learning about TIES, SLERP, and DARE, my model-merging skill
from Lesson 0 feels incomplete. It has strategy selection guidance,
but it doesn't explain the *mechanisms*.

Help me write a new section for my skill called "Technique Mechanisms"
that explains in 5-6 sentences each:
1. Why TIES uses sign election (what problem it solves)
2. Why DARE's dropping works (the insight that enables it)
3. When SLERP beats linear (the magnitude preservation benefit)

Keep it brief—this is reference material, not a tutorial.

What you're learning: Skill refinement—integrating new knowledge into your reusable intelligence asset.

Safety Note

Merging techniques make assumptions about parameter distributions and interference patterns that may not hold for all models. The techniques were developed primarily for decoder-only language models (GPT, Llama). Results may vary for other architectures. Always validate merged models on representative test sets before deployment—technique selection affects model behavior in subtle ways.