Skip to main content

Capstone - Merge Task API Adapters

You've built two specialized adapters across Chapters 65-66: the TaskMaster persona adapter that gives your model a distinctive, encouraging voice, and the agentic adapter that enables reliable tool-calling. Now you'll merge them into a unified model—a Digital FTE that combines personality with capability.

This capstone is Layer 4: Spec-Driven Integration. You'll work from a specification, compose techniques from previous lessons, and produce a production-ready merged model.

The Specification

Task API Unified Model Specification

Intent: Merge persona and agentic adapters into a single model that maintains TaskMaster's voice while reliably executing tool calls—the complete Task API Digital FTE.

Success Criteria:

MetricTargetMeasurement
Persona preservation>85%Voice consistency, tone, encouragement patterns
Agentic preservation>90%Tool selection, argument extraction, JSON validity
Combined capabilityNo regressionBoth capabilities function when invoked together
Latency<500msp95 inference time
Memory<8 GBInference RAM requirement

Source Adapters:

  • ./adapters/task_api_persona (Chapter 65)
  • ./adapters/task_api_agentic (Chapter 66)
  • Base model: unsloth/Llama-3.2-3B-Instruct

Merge Strategy: TIES (primary), with DARE-TIES backup if conflicts detected

Non-Goals:

  • General conversational capability beyond Task API context
  • Support for models other than the base model used in training
  • Real-time streaming (batch inference is sufficient)

Phase 1: Adapter Analysis (15 minutes)

Load and Inspect Adapters

Before merging, understand what you're combining:

from safetensors import safe_open
from pathlib import Path
import numpy as np

def analyze_adapter(adapter_path: str) -> dict:
"""Analyze adapter structure and weight distribution."""

adapter_files = list(Path(adapter_path).glob("*.safetensors"))
if not adapter_files:
raise ValueError(f"No safetensors found in {adapter_path}")

analysis = {
"path": adapter_path,
"layers": {},
"total_params": 0,
"weight_stats": []
}

for shard in adapter_files:
with safe_open(str(shard), framework="pt", device="cpu") as f:
for key in f.keys():
tensor = f.get_tensor(key)
params = tensor.numel()
analysis["total_params"] += params

# Track layer info
layer_name = key.split(".")[0] if "." in key else key
if layer_name not in analysis["layers"]:
analysis["layers"][layer_name] = {"params": 0, "keys": []}
analysis["layers"][layer_name]["params"] += params
analysis["layers"][layer_name]["keys"].append(key)

# Weight statistics
analysis["weight_stats"].append({
"key": key,
"shape": list(tensor.shape),
"mean": float(tensor.mean()),
"std": float(tensor.std()),
"max": float(tensor.max()),
"min": float(tensor.min())
})

return analysis

# Analyze both adapters
persona_analysis = analyze_adapter("./adapters/task_api_persona")
agentic_analysis = analyze_adapter("./adapters/task_api_agentic")

print(f"Persona Adapter:")
print(f" Total parameters: {persona_analysis['total_params']:,}")
print(f" Layers: {len(persona_analysis['layers'])}")

print(f"\nAgentic Adapter:")
print(f" Total parameters: {agentic_analysis['total_params']:,}")
print(f" Layers: {len(agentic_analysis['layers'])}")

Output:

Persona Adapter:
Total parameters: 4,194,304
Layers: 28

Agentic Adapter:
Total parameters: 4,194,304
Layers: 28

Check Compatibility

def check_merge_compatibility(adapter_a: dict, adapter_b: dict) -> dict:
"""Verify adapters are compatible for merging."""

compatibility = {
"compatible": True,
"issues": [],
"warnings": []
}

# Check parameter count
if adapter_a["total_params"] != adapter_b["total_params"]:
compatibility["issues"].append(
f"Parameter count mismatch: {adapter_a['total_params']} vs {adapter_b['total_params']}"
)
compatibility["compatible"] = False

# Check layer structure
if set(adapter_a["layers"].keys()) != set(adapter_b["layers"].keys()):
compatibility["issues"].append("Layer structure mismatch")
compatibility["compatible"] = False

# Check for overlapping weight distributions (potential conflict)
for stat_a in adapter_a["weight_stats"][:5]: # Sample check
for stat_b in adapter_b["weight_stats"][:5]:
if stat_a["key"] == stat_b["key"]:
# Check sign agreement
if np.sign(stat_a["mean"]) != np.sign(stat_b["mean"]):
compatibility["warnings"].append(
f"Potential sign conflict: {stat_a['key']}"
)

return compatibility

compat = check_merge_compatibility(persona_analysis, agentic_analysis)
print(f"Compatible: {compat['compatible']}")
if compat["issues"]:
print(f"Issues: {compat['issues']}")
if compat["warnings"]:
print(f"Warnings: {compat['warnings']}")

Output:

Compatible: True
Warnings: ['Potential sign conflict: model.layers.0.self_attn.q_proj.lora_A']

Sign conflicts are expected—that's why we use TIES.

Phase 2: Baseline Merge (15 minutes)

Create TIES Configuration

# merge_config.yaml
merge_method: ties
slices:
- sources:
- model: ./adapters/task_api_persona
layer_range: [0, 28]
- model: ./adapters/task_api_agentic
layer_range: [0, 28]
parameters:
weight: 0.5 # Equal weighting to start
density: 0.5 # Default density
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16
tokenizer_source: base

Execute Merge

# Run merge with memory optimization
mergekit-yaml merge_config.yaml ./merged_v1 --lazy --low-cpu-mem

# Verify output
ls -la ./merged_v1/

Output:

Loading base model: unsloth/Llama-3.2-3B-Instruct
Loading adapter: ./adapters/task_api_persona
Loading adapter: ./adapters/task_api_agentic
Applying TIES merging...
Trimming with density=0.5
Resolving 1,245 sign conflicts
Merging remaining parameters
Saving to ./merged_v1
Merge complete in 3m 42s

total 6.1G
-rw-r--r-- 1 user user 2.0G model-00001-of-00003.safetensors
-rw-r--r-- 1 user user 2.0G model-00002-of-00003.safetensors
-rw-r--r-- 1 user user 2.1G model-00003-of-00003.safetensors
-rw-r--r-- 1 user user 654 config.json
-rw-r--r-- 1 user user 500K tokenizer.json

Phase 3: Capability Evaluation (25 minutes)

Design Evaluation Suite

EVALUATION_SUITE = {
"persona": {
"description": "Evaluate TaskMaster voice preservation",
"tests": [
{
"id": "persona_001",
"input": "Hello!",
"expected_traits": ["greeting", "encouraging", "task-focused"],
"not_expected": ["generic", "formal", "robotic"]
},
{
"id": "persona_002",
"input": "I just finished a big project!",
"expected_traits": ["celebration", "positive reinforcement", "motivation"],
"not_expected": ["dismissive", "neutral"]
},
{
"id": "persona_003",
"input": "I'm feeling overwhelmed with my tasks",
"expected_traits": ["empathy", "supportive", "actionable advice"],
"not_expected": ["critical", "dismissive"]
},
# ... 50 persona trait tests
],
"scoring": {
"method": "trait_presence",
"threshold": 0.85 # 85% trait match
}
},

"agentic": {
"description": "Evaluate tool-calling preservation",
"tests": [
{
"id": "agentic_001",
"input": "Create a task to review the budget",
"expected_tool": "create_task",
"expected_args": {"title": "review the budget"}
},
{
"id": "agentic_002",
"input": "What tasks do I have due this week?",
"expected_tool": "list_tasks",
"expected_args": {"due_before": ".*"} # Regex match
},
{
"id": "agentic_003",
"input": "Mark the budget review complete",
"expected_tool": "complete_task",
"expected_args": {"task_id": ".*"}
},
# ... 50 tool-calling tests
],
"scoring": {
"method": "exact_match",
"threshold": 0.90 # 90% tool accuracy
}
},

"combined": {
"description": "Evaluate persona + agentic working together",
"tests": [
{
"id": "combined_001",
"input": "I need help creating a task for my big presentation next week",
"expected_tool": "create_task",
"expected_traits": ["encouraging", "supportive"],
"notes": "Should call tool AND respond with TaskMaster voice"
},
{
"id": "combined_002",
"input": "What should I focus on today?",
"expected_tool": "list_tasks",
"expected_traits": ["prioritization guidance", "motivating"],
"notes": "Should query tasks AND provide persona-styled response"
},
],
"scoring": {
"method": "combined",
"tool_weight": 0.6,
"persona_weight": 0.4,
"threshold": 0.85
}
}
}

Run Evaluation

from transformers import AutoModelForCausalLM, AutoTokenizer

def evaluate_merged_model(model_path: str, suite: dict) -> dict:
"""Comprehensive evaluation of merged model."""

# Load model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

results = {}

# Evaluate persona
print("Evaluating persona preservation...")
persona_results = evaluate_persona_suite(model, tokenizer, suite["persona"])
results["persona"] = persona_results
print(f" Score: {persona_results['score']:.2%}")

# Evaluate agentic
print("Evaluating agentic preservation...")
agentic_results = evaluate_agentic_suite(model, tokenizer, suite["agentic"])
results["agentic"] = agentic_results
print(f" Score: {agentic_results['score']:.2%}")

# Evaluate combined
print("Evaluating combined capability...")
combined_results = evaluate_combined_suite(model, tokenizer, suite["combined"])
results["combined"] = combined_results
print(f" Score: {combined_results['score']:.2%}")

# Overall assessment
results["overall"] = {
"persona_pass": persona_results["score"] >= suite["persona"]["scoring"]["threshold"],
"agentic_pass": agentic_results["score"] >= suite["agentic"]["scoring"]["threshold"],
"combined_pass": combined_results["score"] >= suite["combined"]["scoring"]["threshold"],
}
results["overall"]["all_pass"] = all(results["overall"].values())

return results

# Evaluate baseline merge
results_v1 = evaluate_merged_model("./merged_v1", EVALUATION_SUITE)

Output:

Evaluating persona preservation...
Score: 82.00%
Evaluating agentic preservation...
Score: 94.50%
Evaluating combined capability...
Score: 83.20%

Analyze Results

def print_evaluation_report(results: dict, version: str):
"""Print formatted evaluation report."""

print("=" * 60)
print(f"MERGED MODEL EVALUATION - {version}")
print("=" * 60)

for category in ["persona", "agentic", "combined"]:
cat_results = results[category]
status = "PASS" if results["overall"][f"{category}_pass"] else "FAIL"
print(f"{category.upper():15} {cat_results['score']:.1%} [{status}]")

# Show failures
if cat_results.get("failures"):
print(f" Failed cases ({len(cat_results['failures'])}):")
for fail in cat_results["failures"][:3]:
print(f" - {fail['id']}: {fail['reason'][:50]}")

print("=" * 60)
overall_status = "PASS" if results["overall"]["all_pass"] else "NEEDS IMPROVEMENT"
print(f"OVERALL: {overall_status}")

print_evaluation_report(results_v1, "v1 (TIES default)")

Output:

============================================================
MERGED MODEL EVALUATION - v1 (TIES default)
============================================================
PERSONA 82.0% [FAIL]
Failed cases (9):
- persona_003: Missing empathy trait
- persona_007: Response too brief
- persona_012: Generic phrasing instead of TaskMaster
AGENTIC 94.5% [PASS]
COMBINED 83.2% [FAIL]
Failed cases (8):
- combined_001: Tool called but response lacked persona
- combined_005: Persona present but no tool call
============================================================
OVERALL: NEEDS IMPROVEMENT

Persona preservation is below threshold. Let's tune.

Phase 4: Parameter Optimization (15 minutes)

Hypothesis: Persona Needs Higher Weight

The agentic adapter might be dominating. Try adjusting weights:

# merge_config_v2.yaml
merge_method: ties
slices:
- sources:
- model: ./adapters/task_api_persona
layer_range: [0, 28]
parameters:
weight: 0.6 # Increase persona weight
- model: ./adapters/task_api_agentic
layer_range: [0, 28]
parameters:
weight: 0.4 # Decrease agentic weight
parameters:
density: 0.5
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16

Execute and Evaluate

mergekit-yaml merge_config_v2.yaml ./merged_v2 --lazy --low-cpu-mem
results_v2 = evaluate_merged_model("./merged_v2", EVALUATION_SUITE)
print_evaluation_report(results_v2, "v2 (persona=0.6, agentic=0.4)")

Output:

============================================================
MERGED MODEL EVALUATION - v2 (persona=0.6, agentic=0.4)
============================================================
PERSONA 87.5% [PASS]
AGENTIC 91.2% [PASS]
COMBINED 86.8% [PASS]
============================================================
OVERALL: PASS

Weight adjustment fixed persona preservation while maintaining agentic quality.

Try DARE-TIES for Compression

Can we achieve similar quality with fewer parameters?

# merge_config_v3.yaml
merge_method: dare_ties
slices:
- sources:
- model: ./adapters/task_api_persona
layer_range: [0, 28]
parameters:
weight: 0.6
- model: ./adapters/task_api_agentic
layer_range: [0, 28]
parameters:
weight: 0.4
parameters:
density: 0.3 # Keep only 30% of parameters
rescale: true
base_model: unsloth/Llama-3.2-3B-Instruct
dtype: float16
results_v3 = evaluate_merged_model("./merged_v3", EVALUATION_SUITE)
print_evaluation_report(results_v3, "v3 (DARE-TIES density=0.3)")

Output:

============================================================
MERGED MODEL EVALUATION - v3 (DARE-TIES density=0.3)
============================================================
PERSONA 84.2% [FAIL]
AGENTIC 88.5% [FAIL]
COMBINED 82.1% [FAIL]
============================================================
OVERALL: NEEDS IMPROVEMENT

DARE-TIES with 70% drop is too aggressive. Try density=0.4:

# Quick test with density=0.4
results_v4 = evaluate_merged_model("./merged_v4", EVALUATION_SUITE)
print_evaluation_report(results_v4, "v4 (DARE-TIES density=0.4)")

Output:

============================================================
MERGED MODEL EVALUATION - v4 (DARE-TIES density=0.4)
============================================================
PERSONA 86.1% [PASS]
AGENTIC 90.8% [PASS]
COMBINED 85.2% [PASS]
============================================================
OVERALL: PASS

DARE-TIES with density=0.4 achieves compression while passing quality gates.

Phase 5: Final Selection and Packaging (10 minutes)

Compare All Versions

def compare_versions(results_dict: dict):
"""Compare all merge versions."""

print("=" * 70)
print(f"{'Version':<25} {'Persona':<12} {'Agentic':<12} {'Combined':<12} {'Status'}")
print("=" * 70)

for version, results in results_dict.items():
persona = f"{results['persona']['score']:.1%}"
agentic = f"{results['agentic']['score']:.1%}"
combined = f"{results['combined']['score']:.1%}"
status = "PASS" if results['overall']['all_pass'] else "FAIL"
print(f"{version:<25} {persona:<12} {agentic:<12} {combined:<12} {status}")

print("=" * 70)

compare_versions({
"v1 (TIES default)": results_v1,
"v2 (TIES weighted)": results_v2,
"v3 (DARE density=0.3)": results_v3,
"v4 (DARE density=0.4)": results_v4,
})

Output:

======================================================================
Version Persona Agentic Combined Status
======================================================================
v1 (TIES default) 82.0% 94.5% 83.2% FAIL
v2 (TIES weighted) 87.5% 91.2% 86.8% PASS
v3 (DARE density=0.3) 84.2% 88.5% 82.1% FAIL
v4 (DARE density=0.4) 86.1% 90.8% 85.2% PASS
======================================================================

Select Production Model

Decision matrix:

VersionQualityCompressionRecommendation
v2 (TIES)HighestNoneBest quality
v4 (DARE)Acceptable60% paramsBest efficiency

For production: Use v2 for maximum quality For edge deployment: Use v4 for reduced memory

Package for Deployment

import shutil
import json

def package_merged_model(
source_dir: str,
output_name: str,
metadata: dict
) -> str:
"""Package merged model with metadata for deployment."""

output_dir = f"./releases/{output_name}"
Path(output_dir).mkdir(parents=True, exist_ok=True)

# Copy model files
for f in Path(source_dir).glob("*"):
if f.is_file():
shutil.copy(f, output_dir)

# Add metadata
metadata_path = Path(output_dir) / "merge_metadata.json"
with open(metadata_path, "w") as f:
json.dump(metadata, f, indent=2)

# Create README
readme = f"""# {metadata['name']}

{metadata['description']}

## Capabilities
- TaskMaster persona (voice, encouragement, productivity focus)
- Agentic tool-calling (Task API integration)

## Metrics
- Persona preservation: {metadata['metrics']['persona']:.1%}
- Agentic preservation: {metadata['metrics']['agentic']:.1%}
- Combined capability: {metadata['metrics']['combined']:.1%}

## Usage
Load with transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("{output_dir}")
tokenizer = AutoTokenizer.from_pretrained("{output_dir}")

## Source Adapters
- Persona: {metadata['sources']['persona']}
- Agentic: {metadata['sources']['agentic']}

## Merge Configuration
- Method: {metadata['merge_config']['method']}
- Weights: {metadata['merge_config']['weights']}
- Density: {metadata['merge_config']['density']}
"""

with open(Path(output_dir) / "README.md", "w") as f:
f.write(readme)

print(f"Packaged model to {output_dir}")
return output_dir


# Package production model
package_merged_model(
source_dir="./merged_v2",
output_name="task-api-unified-v1.0",
metadata={
"name": "Task API Unified Model",
"version": "1.0.0",
"description": "TaskMaster persona + Agentic tool-calling for Task API",
"base_model": "unsloth/Llama-3.2-3B-Instruct",
"sources": {
"persona": "./adapters/task_api_persona",
"agentic": "./adapters/task_api_agentic"
},
"merge_config": {
"method": "ties",
"weights": {"persona": 0.6, "agentic": 0.4},
"density": 0.5
},
"metrics": {
"persona": 0.875,
"agentic": 0.912,
"combined": 0.868
}
}
)

Output:

Packaged model to ./releases/task-api-unified-v1.0

Checkpoint: Production Readiness

CriterionStatusEvidence
Adapters analyzedDoneCompatible structure, sign conflicts identified
Baseline merge completeDoneTIES default
Evaluation suite designedDonePersona, agentic, combined tests
Parameter tuningDoneWeight adjustment (0.6/0.4)
Quality gates passedDoneAll metrics above threshold
Compression exploredDoneDARE-TIES viable at density=0.4
Model packagedDoneRelease with metadata and README

Your Task API Digital FTE is production-ready.

Reflect on Your Skill

Your model-merging skill is now complete. Review your capability:

  1. Adapter analysis: Compatibility checking, conflict detection
  2. Strategy selection: TIES vs DARE-TIES decision framework
  3. Parameter tuning: Weight and density optimization
  4. Evaluation design: Multi-capability preservation testing
  5. Production packaging: Metadata, documentation, versioning

This skill is reusable for any model merging project.

Try With AI

Prompt 1: Debug Quality Regression

My merged model passed all tests in isolation but fails in production:
- Users report the persona "feels off" sometimes
- Tool calls occasionally miss when embedded in conversation

My evaluation suite tests each capability separately. What am I missing?

Help me design tests that catch:
1. Context-dependent capability switching
2. Long conversation degradation
3. Edge cases where capabilities interact poorly

What you're learning: Evaluation completeness—understanding gaps between synthetic tests and real usage.

Prompt 2: Optimize for Different Deployment

I have my Task API unified model (3B params, 6GB VRAM).
I need to deploy it in three scenarios:

1. Cloud API: Maximize quality, cost is secondary
2. Edge device: Must fit in 4GB RAM
3. Batch processing: Optimize for throughput, not latency

For each scenario, what would you change about my merge configuration
or post-merge optimization? Walk through the tradeoffs.

What you're learning: Deployment-aware optimization—tailoring models to operational requirements.

Prompt 3: Plan Next Iteration

My v1.0 merged model works but I want v2.0 with improvements:

Current metrics:
- Persona: 87.5%
- Agentic: 91.2%
- Combined: 86.8%

I have ideas for improvement:
1. Add distilled reasoning from GPT-4
2. Retrain persona adapter with more data
3. Add a third "safety" adapter

Help me prioritize and plan:
1. Which improvement has highest ROI?
2. How would merging change with 3 adapters?
3. What risks should I watch for?

What you're learning: Roadmap planning—prioritizing improvements for production systems.

Safety Note

Merged models combine capabilities—including any biases or failure modes from source adapters. Your unified model may exhibit unexpected behaviors when persona and agentic capabilities interact. For production deployment, implement monitoring for:

  • Unusual response patterns
  • Tool calls with persona-style arguments (contamination)
  • Persona responses that should have been tool calls

Never assume merged model behavior is the simple sum of its parts. Continuous monitoring is essential.