Skip to main content

Capstone - Build Task API Agentic Model

You've learned the components: structured output training, function calling patterns, multi-tool orchestration. Now you'll integrate everything into a complete agentic model for Task API—your own Digital FTE for task management.

This capstone is Layer 4: Spec-Driven Integration. You'll work from a specification, compose skills from previous lessons, and produce a production-ready model that can be packaged as a sellable Digital FTE.

The Specification

Task API Agentic Model Specification

Intent: Create a fine-tuned model capable of serving as the reasoning backend for a Task API agent—selecting appropriate tools, extracting arguments from natural language, and orchestrating multi-step workflows.

Success Criteria:

MetricTargetMeasurement
Tool selection accuracy>95%Correct tool chosen for request
Argument extraction accuracy>90%All required args correct
JSON validity>99%Parseable without errors
Multi-tool completion>85%Chains executed fully
Latency (local inference)<500msp95 response time

Constraints:

  • Base model: Llama-3.2-3B-Instruct (or equivalent 3B-8B model)
  • Training data: 500+ examples covering all patterns
  • Must work with OpenAI Agents SDK tool_calls format
  • No external API dependencies at inference time

Non-Goals:

  • General conversation capability (focus on tool-calling)
  • Multi-language support (English only)
  • Streaming responses (batch is sufficient)

Phase 1: Dataset Assembly (30 minutes)

Aggregate Your Training Data

Combine examples from Lessons 3-6:

import json
from pathlib import Path

def aggregate_datasets(sources: list[str], output: str) -> int:
"""Combine training examples from multiple sources."""
all_examples = []

for source in sources:
with open(source) as f:
examples = [json.loads(line) for line in f]
all_examples.extend(examples)
print(f" {source}: {len(examples)} examples")

# Shuffle for training
import random
random.shuffle(all_examples)

# Write combined dataset
with open(output, "w") as f:
for ex in all_examples:
f.write(json.dumps(ex) + "\n")

print(f"Total: {len(all_examples)} examples -> {output}")
return len(all_examples)

# Aggregate from all lesson outputs
sources = [
"ch66_structured_outputs.jsonl", # Lesson 3: Structured output training
"ch66_function_calling.jsonl", # Lesson 4: Task API function calling
"ch66_multi_tool.jsonl", # Lesson 6: Multi-tool orchestration
]

total = aggregate_datasets(sources, "task_api_agentic_complete.jsonl")

Output:

  ch66_structured_outputs.jsonl: 200 examples
ch66_function_calling.jsonl: 250 examples
ch66_multi_tool.jsonl: 150 examples
Total: 600 examples -> task_api_agentic_complete.jsonl

Validate Dataset Distribution

def analyze_distribution(dataset_path: str) -> dict:
"""Analyze training data distribution."""
stats = {
"total": 0,
"single_tool": 0,
"multi_tool": 0,
"tools": {},
"avg_turns": 0
}

total_turns = 0

with open(dataset_path) as f:
for line in f:
ex = json.loads(line)
stats["total"] += 1

# Count tool calls
tool_calls = []
for msg in ex["messages"]:
if msg.get("tool_calls"):
tool_calls.extend(msg["tool_calls"])
for tc in msg["tool_calls"]:
tool_name = tc["function"]["name"]
stats["tools"][tool_name] = stats["tools"].get(tool_name, 0) + 1

# Categorize
if len(tool_calls) == 1:
stats["single_tool"] += 1
else:
stats["multi_tool"] += 1

# Count turns
total_turns += len(ex["messages"])

stats["avg_turns"] = total_turns / stats["total"]
return stats

stats = analyze_distribution("task_api_agentic_complete.jsonl")
print(f"Distribution:")
print(f" Single-tool: {stats['single_tool']} ({stats['single_tool']/stats['total']*100:.1f}%)")
print(f" Multi-tool: {stats['multi_tool']} ({stats['multi_tool']/stats['total']*100:.1f}%)")
print(f" Avg turns: {stats['avg_turns']:.1f}")
print(f"Tool usage:")
for tool, count in sorted(stats["tools"].items(), key=lambda x: -x[1]):
print(f" {tool}: {count}")

Output:

Distribution:
Single-tool: 450 (75.0%)
Multi-tool: 150 (25.0%)
Avg turns: 4.2
Tool usage:
create_task: 180
list_tasks: 145
update_task: 120
complete_task: 95
delete_task: 60
create_project: 45
get_calendar: 35
create_reminder: 20

Balance Check

AspectCurrentTargetStatus
Single vs multi-tool75/2560-80/20-40OK
All tools representedYesYesOK
Min per tool2020+OK

If distribution is imbalanced, generate additional examples for underrepresented patterns.

Phase 2: Training Execution (20 minutes active, training runs in background)

Configure Training

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
max_seq_length=4096, # Longer for multi-tool conversations
dtype=None,
load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=32, # Higher rank for complex tool-calling
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)

# Training configuration optimized for agentic fine-tuning
training_args = TrainingArguments(
output_dir="./task_api_agent",
num_train_epochs=4, # More epochs for structured output
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch: 8
learning_rate=1.5e-5, # Lower LR for precision
warmup_ratio=0.1,
logging_steps=25,
save_steps=100,
eval_strategy="steps",
eval_steps=100,
fp16=True,
optim="adamw_8bit",
seed=42,
)

Prepare Data Loader

from datasets import load_dataset

def format_for_training(example):
"""Convert to chat format for training."""
# Unsloth expects specific format
messages = example["messages"]
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
return {"text": formatted}

# Load dataset
dataset = load_dataset("json", data_files={
"train": "task_api_agentic_complete.jsonl"
})

# Split for validation
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)

# Format
dataset = dataset.map(format_for_training)

print(f"Train: {len(dataset['train'])} examples")
print(f"Validation: {len(dataset['test'])} examples")

Output:

Train: 540 examples
Validation: 60 examples

Execute Training

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
args=training_args,
dataset_text_field="text",
max_seq_length=4096,
)

# Train
print("Starting agentic fine-tuning...")
trainer.train()

# Save
model.save_pretrained("task_api_agent_final")
tokenizer.save_pretrained("task_api_agent_final")
print("Model saved to task_api_agent_final/")

Output:

Starting agentic fine-tuning...
{'loss': 0.7823, 'learning_rate': 3e-06, 'epoch': 0.25}
{'loss': 0.4521, 'learning_rate': 1.2e-05, 'epoch': 0.5}
{'loss': 0.2156, 'learning_rate': 1.5e-05, 'epoch': 0.75}
{'eval_loss': 0.1823, 'epoch': 1.0}
...
{'loss': 0.0423, 'learning_rate': 3e-06, 'epoch': 3.75}
{'eval_loss': 0.0512, 'epoch': 4.0}
Training complete in 24:18
Model saved to task_api_agent_final/

Phase 3: Evaluation Benchmark (25 minutes)

Design Test Suite

Create comprehensive test cases covering all success criteria:

EVALUATION_SUITE = {
"tool_selection": [
{
"input": "Create a task to review the budget",
"expected_tool": "create_task",
"category": "creation"
},
{
"input": "What tasks do I have due this week?",
"expected_tool": "list_tasks",
"category": "query"
},
{
"input": "Mark the budget review as done",
"expected_tool": "complete_task",
"category": "update"
},
{
"input": "Change the priority of task 123 to high",
"expected_tool": "update_task",
"category": "update"
},
# ... 50+ test cases covering all tools
],

"argument_extraction": [
{
"input": "Create a high-priority task called 'Review Q4' due Friday",
"expected_tool": "create_task",
"expected_args": {
"title": "Review Q4",
"priority": "high",
"due_date": "2024-01-19" # Relative date resolution
}
},
{
"input": "List my tasks tagged with 'urgent'",
"expected_tool": "list_tasks",
"expected_args": {
"tags": ["urgent"]
}
},
# ... 50+ test cases with varied arguments
],

"multi_tool_chains": [
{
"input": "Create a project called Q1 Goals and add a task for budgeting",
"expected_chain": ["create_project", "create_task"],
"dependency": {"create_task": {"project_id": "from:create_project"}}
},
{
"input": "What tasks and meetings do I have today?",
"expected_parallel": ["list_tasks", "get_calendar"]
},
# ... 30+ multi-tool scenarios
]
}

Run Evaluation

def evaluate_model(model_path: str, test_suite: dict) -> dict:
"""Run comprehensive evaluation."""
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

results = {
"tool_selection": {"correct": 0, "total": 0},
"argument_extraction": {"correct": 0, "total": 0},
"json_validity": {"valid": 0, "total": 0},
"multi_tool": {"complete": 0, "total": 0},
"details": []
}

# Tool selection tests
for test in test_suite["tool_selection"]:
result = run_inference(model, tokenizer, test["input"])
results["tool_selection"]["total"] += 1
results["json_validity"]["total"] += 1

if is_valid_json(result):
results["json_validity"]["valid"] += 1
if extract_tool_name(result) == test["expected_tool"]:
results["tool_selection"]["correct"] += 1
else:
results["details"].append({
"test": test["input"],
"expected": test["expected_tool"],
"got": extract_tool_name(result)
})

# Argument extraction tests
for test in test_suite["argument_extraction"]:
result = run_inference(model, tokenizer, test["input"])
results["argument_extraction"]["total"] += 1
results["json_validity"]["total"] += 1

if is_valid_json(result):
results["json_validity"]["valid"] += 1
if arguments_match(result, test["expected_args"]):
results["argument_extraction"]["correct"] += 1

# Multi-tool tests
for test in test_suite["multi_tool_chains"]:
result = run_multi_turn(model, tokenizer, test["input"])
results["multi_tool"]["total"] += 1

if "expected_chain" in test:
if chain_complete(result, test["expected_chain"]):
results["multi_tool"]["complete"] += 1
elif "expected_parallel" in test:
if parallel_used(result, test["expected_parallel"]):
results["multi_tool"]["complete"] += 1

return results

# Run evaluation
results = evaluate_model("task_api_agent_final", EVALUATION_SUITE)

Evaluation Report

def print_evaluation_report(results: dict, spec_targets: dict):
"""Print formatted evaluation report."""
print("=" * 60)
print("TASK API AGENTIC MODEL - EVALUATION REPORT")
print("=" * 60)

metrics = [
("Tool Selection", results["tool_selection"]["correct"],
results["tool_selection"]["total"], spec_targets["tool_selection"]),
("Argument Extraction", results["argument_extraction"]["correct"],
results["argument_extraction"]["total"], spec_targets["argument_extraction"]),
("JSON Validity", results["json_validity"]["valid"],
results["json_validity"]["total"], spec_targets["json_validity"]),
("Multi-Tool Completion", results["multi_tool"]["complete"],
results["multi_tool"]["total"], spec_targets["multi_tool"]),
]

all_pass = True
for name, correct, total, target in metrics:
rate = correct / total * 100
status = "PASS" if rate >= target * 100 else "FAIL"
if status == "FAIL":
all_pass = False
print(f"{name:25} {correct:3}/{total:3} ({rate:5.1f}%) Target: {target*100:.0f}% [{status}]")

print("=" * 60)
print(f"OVERALL: {'PASS - Ready for deployment' if all_pass else 'FAIL - Needs improvement'}")

if not all_pass and results["details"]:
print("\nFailed cases (sample):")
for detail in results["details"][:5]:
print(f" Input: {detail['test'][:50]}...")
print(f" Expected: {detail['expected']}, Got: {detail['got']}")

spec_targets = {
"tool_selection": 0.95,
"argument_extraction": 0.90,
"json_validity": 0.99,
"multi_tool": 0.85
}

print_evaluation_report(results, spec_targets)

Output:

============================================================
TASK API AGENTIC MODEL - EVALUATION REPORT
============================================================
Tool Selection 48/ 50 (96.0%) Target: 95% [PASS]
Argument Extraction 46/ 50 (92.0%) Target: 90% [PASS]
JSON Validity 99/100 (99.0%) Target: 99% [PASS]
Multi-Tool Completion 27/ 30 (90.0%) Target: 85% [PASS]
============================================================
OVERALL: PASS - Ready for deployment

Phase 4: Agent Framework Integration (15 minutes)

Configure as OpenAI Agents SDK Backend

For local models, use a compatibility layer:

from openai import OpenAI
import subprocess
import time

# Start local inference server (vLLM or similar)
def start_inference_server(model_path: str, port: int = 8000):
"""Start vLLM server with the fine-tuned model."""
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", model_path,
"--port", str(port),
"--dtype", "float16",
]
process = subprocess.Popen(cmd)
time.sleep(30) # Wait for server startup
return process

# Connect OpenAI client to local server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-for-local"
)

# Verify connection
models = client.models.list()
print(f"Available models: {[m.id for m in models.data]}")

Output:

Available models: ['task_api_agent_final']

Create Agent with Custom Model

from agents import Agent, Tool, function_tool

# Define Task API tools
@function_tool
def create_task(title: str, priority: str = "medium", due_date: str = None) -> dict:
"""Create a new task in the Task API."""
# Actual implementation would call Task API
return {"task_id": f"task_{hash(title) % 10000}", "title": title, "priority": priority}

@function_tool
def list_tasks(due_before: str = None, priority: str = None) -> list:
"""List tasks from the Task API."""
# Actual implementation would call Task API
return [{"task_id": "task_001", "title": "Sample task", "priority": "high"}]

@function_tool
def complete_task(task_id: str) -> dict:
"""Mark a task as complete."""
return {"task_id": task_id, "status": "completed"}

# Create agent with custom model
task_agent = Agent(
name="TaskMaster Agent",
model="task_api_agent_final", # Your fine-tuned model
instructions="""You are TaskMaster, a productivity assistant.
Use the available tools to manage tasks. Be encouraging and helpful.
Always use tools for task operations - don't just describe what you would do.""",
tools=[create_task, list_tasks, complete_task],
)

Test End-to-End Workflow

from agents import Runner

async def test_agent_workflow():
"""Test complete agent workflow with custom model."""
runner = Runner()

test_cases = [
"Create a high-priority task to review the Q4 budget",
"What tasks do I have right now?",
"Mark task_001 as complete",
]

for user_input in test_cases:
print(f"\nUser: {user_input}")
result = await runner.run(task_agent, user_input)
print(f"Agent: {result.final_output}")
print(f"Tools called: {[t.name for t in result.tool_calls]}")

# Run test
import asyncio
asyncio.run(test_agent_workflow())

Output:

User: Create a high-priority task to review the Q4 budget
Agent: I've created a high-priority task "Review the Q4 budget" for you.
Tools called: ['create_task']

User: What tasks do I have right now?
Agent: You have 1 task: "Sample task" (high priority).
Tools called: ['list_tasks']

User: Mark task_001 as complete
Agent: Done! Task task_001 is now marked as complete.
Tools called: ['complete_task']

Checkpoint: Production Readiness Checklist

Before declaring the capstone complete:

CriterionStatusEvidence
Dataset complete (500+ examples)Check600 examples aggregated
Distribution balancedCheck75/25 single/multi-tool
Training convergedCheckLoss < 0.1, no divergence
Tool selection >95%Check96.0% in evaluation
Argument extraction >90%Check92.0% in evaluation
JSON validity >99%Check99.0% in evaluation
Multi-tool completion >85%Check90.0% in evaluation
Agent framework integrationCheckOpenAI SDK compatible
Latency acceptableCheck<500ms p95

All criteria met: Model is production-ready.

Packaging as Digital FTE

Your agentic model is now a Digital FTE component. To monetize:

Option 1: API Service

# docker-compose.yml for hosted deployment
services:
task-api-agent:
image: vllm/vllm-openai:latest
command: --model /models/task_api_agent_final
volumes:
- ./task_api_agent_final:/models/task_api_agent_final
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

Option 2: Skill Package

Bundle as a reusable skill for other agent builders:

# task_api_agent_skill.yaml
name: task-api-agentic-backend
version: 1.0.0
description: Fine-tuned model for Task API tool-calling
model:
path: ./task_api_agent_final
format: safetensors
base: Llama-3.2-3B-Instruct
capabilities:
- tool-calling
- multi-tool-orchestration
- json-structured-output
tools_supported:
- create_task
- list_tasks
- update_task
- complete_task
- delete_task
- create_project
metrics:
tool_selection: 0.96
argument_extraction: 0.92
json_validity: 0.99
multi_tool_completion: 0.90

Reflect on Your Skill

Your agentic-tuning skill is now complete. Review what you've built:

  1. Data generation patterns: Structured output, function calling, multi-tool chains
  2. Training configuration: Hyperparameters optimized for agentic tasks
  3. Evaluation framework: Metrics covering all agentic capabilities
  4. Integration patterns: Framework compatibility and deployment options

This skill is reusable for any tool-calling model you build in the future.

Try With AI

Prompt 1: Diagnose Evaluation Failures

My agentic model passed overall but has specific failure patterns:

Tool Selection: 96% (PASS)
Argument Extraction: 92% (PASS)
JSON Validity: 99% (PASS)
Multi-Tool Completion: 72% (FAIL - target 85%)

The multi-tool failures are mostly in 3+ tool chains. Two-tool chains work fine.

Help me diagnose:
1. What's likely wrong with my training data?
2. What specific examples should I add?
3. How do I validate the fix before full retraining?

What you're learning: Iterative improvement—using evaluation results to guide targeted fixes.

Prompt 2: Optimize for Production

My Task API agentic model works but inference is slow:
- Current: 800ms p95 latency
- Target: <500ms p95

My setup:
- Model: Llama-3.2-3B with LoRA merged
- Hardware: RTX 4090
- Server: vLLM

What optimization options should I explore?
Consider quantization, batching, prompt caching, and model distillation.
Help me create a testing plan to find the best latency/quality tradeoff.

What you're learning: Production optimization—balancing quality against operational requirements.

Prompt 3: Extend to New Domain

I've built a Task API agentic model. Now I want to apply the same approach
to a different domain: Customer Support with tools like:
- lookup_customer(email)
- get_order_history(customer_id)
- create_ticket(customer_id, issue)
- send_response(ticket_id, message)

Walk me through adapting my agentic-tuning skill:
1. What dataset patterns transfer directly?
2. What new patterns do I need for support-specific scenarios?
3. What evaluation metrics should change?
4. How can I reuse my existing codebase?

What you're learning: Pattern transfer—applying agentic tuning methodology to new domains.

Safety Note

Your agentic model can execute real operations when connected to production APIs. Before deployment, implement rate limiting, confirmation flows for destructive operations, and audit logging. A model that reliably calls delete_task is powerful—ensure that power is constrained by appropriate guardrails. Never deploy an agentic model with write access to production systems without human-in-the-loop confirmation for high-risk operations.