Capstone - Build Task API Agentic Model
You've learned the components: structured output training, function calling patterns, multi-tool orchestration. Now you'll integrate everything into a complete agentic model for Task API—your own Digital FTE for task management.
This capstone is Layer 4: Spec-Driven Integration. You'll work from a specification, compose skills from previous lessons, and produce a production-ready model that can be packaged as a sellable Digital FTE.
The Specification
Task API Agentic Model Specification
Intent: Create a fine-tuned model capable of serving as the reasoning backend for a Task API agent—selecting appropriate tools, extracting arguments from natural language, and orchestrating multi-step workflows.
Success Criteria:
| Metric | Target | Measurement |
|---|---|---|
| Tool selection accuracy | >95% | Correct tool chosen for request |
| Argument extraction accuracy | >90% | All required args correct |
| JSON validity | >99% | Parseable without errors |
| Multi-tool completion | >85% | Chains executed fully |
| Latency (local inference) | <500ms | p95 response time |
Constraints:
- Base model: Llama-3.2-3B-Instruct (or equivalent 3B-8B model)
- Training data: 500+ examples covering all patterns
- Must work with OpenAI Agents SDK tool_calls format
- No external API dependencies at inference time
Non-Goals:
- General conversation capability (focus on tool-calling)
- Multi-language support (English only)
- Streaming responses (batch is sufficient)
Phase 1: Dataset Assembly (30 minutes)
Aggregate Your Training Data
Combine examples from Lessons 3-6:
import json
from pathlib import Path
def aggregate_datasets(sources: list[str], output: str) -> int:
"""Combine training examples from multiple sources."""
all_examples = []
for source in sources:
with open(source) as f:
examples = [json.loads(line) for line in f]
all_examples.extend(examples)
print(f" {source}: {len(examples)} examples")
# Shuffle for training
import random
random.shuffle(all_examples)
# Write combined dataset
with open(output, "w") as f:
for ex in all_examples:
f.write(json.dumps(ex) + "\n")
print(f"Total: {len(all_examples)} examples -> {output}")
return len(all_examples)
# Aggregate from all lesson outputs
sources = [
"ch66_structured_outputs.jsonl", # Lesson 3: Structured output training
"ch66_function_calling.jsonl", # Lesson 4: Task API function calling
"ch66_multi_tool.jsonl", # Lesson 6: Multi-tool orchestration
]
total = aggregate_datasets(sources, "task_api_agentic_complete.jsonl")
Output:
ch66_structured_outputs.jsonl: 200 examples
ch66_function_calling.jsonl: 250 examples
ch66_multi_tool.jsonl: 150 examples
Total: 600 examples -> task_api_agentic_complete.jsonl
Validate Dataset Distribution
def analyze_distribution(dataset_path: str) -> dict:
"""Analyze training data distribution."""
stats = {
"total": 0,
"single_tool": 0,
"multi_tool": 0,
"tools": {},
"avg_turns": 0
}
total_turns = 0
with open(dataset_path) as f:
for line in f:
ex = json.loads(line)
stats["total"] += 1
# Count tool calls
tool_calls = []
for msg in ex["messages"]:
if msg.get("tool_calls"):
tool_calls.extend(msg["tool_calls"])
for tc in msg["tool_calls"]:
tool_name = tc["function"]["name"]
stats["tools"][tool_name] = stats["tools"].get(tool_name, 0) + 1
# Categorize
if len(tool_calls) == 1:
stats["single_tool"] += 1
else:
stats["multi_tool"] += 1
# Count turns
total_turns += len(ex["messages"])
stats["avg_turns"] = total_turns / stats["total"]
return stats
stats = analyze_distribution("task_api_agentic_complete.jsonl")
print(f"Distribution:")
print(f" Single-tool: {stats['single_tool']} ({stats['single_tool']/stats['total']*100:.1f}%)")
print(f" Multi-tool: {stats['multi_tool']} ({stats['multi_tool']/stats['total']*100:.1f}%)")
print(f" Avg turns: {stats['avg_turns']:.1f}")
print(f"Tool usage:")
for tool, count in sorted(stats["tools"].items(), key=lambda x: -x[1]):
print(f" {tool}: {count}")
Output:
Distribution:
Single-tool: 450 (75.0%)
Multi-tool: 150 (25.0%)
Avg turns: 4.2
Tool usage:
create_task: 180
list_tasks: 145
update_task: 120
complete_task: 95
delete_task: 60
create_project: 45
get_calendar: 35
create_reminder: 20
Balance Check
| Aspect | Current | Target | Status |
|---|---|---|---|
| Single vs multi-tool | 75/25 | 60-80/20-40 | OK |
| All tools represented | Yes | Yes | OK |
| Min per tool | 20 | 20+ | OK |
If distribution is imbalanced, generate additional examples for underrepresented patterns.
Phase 2: Training Execution (20 minutes active, training runs in background)
Configure Training
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
max_seq_length=4096, # Longer for multi-tool conversations
dtype=None,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=32, # Higher rank for complex tool-calling
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Training configuration optimized for agentic fine-tuning
training_args = TrainingArguments(
output_dir="./task_api_agent",
num_train_epochs=4, # More epochs for structured output
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch: 8
learning_rate=1.5e-5, # Lower LR for precision
warmup_ratio=0.1,
logging_steps=25,
save_steps=100,
eval_strategy="steps",
eval_steps=100,
fp16=True,
optim="adamw_8bit",
seed=42,
)
Prepare Data Loader
from datasets import load_dataset
def format_for_training(example):
"""Convert to chat format for training."""
# Unsloth expects specific format
messages = example["messages"]
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
return {"text": formatted}
# Load dataset
dataset = load_dataset("json", data_files={
"train": "task_api_agentic_complete.jsonl"
})
# Split for validation
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)
# Format
dataset = dataset.map(format_for_training)
print(f"Train: {len(dataset['train'])} examples")
print(f"Validation: {len(dataset['test'])} examples")
Output:
Train: 540 examples
Validation: 60 examples
Execute Training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
args=training_args,
dataset_text_field="text",
max_seq_length=4096,
)
# Train
print("Starting agentic fine-tuning...")
trainer.train()
# Save
model.save_pretrained("task_api_agent_final")
tokenizer.save_pretrained("task_api_agent_final")
print("Model saved to task_api_agent_final/")
Output:
Starting agentic fine-tuning...
{'loss': 0.7823, 'learning_rate': 3e-06, 'epoch': 0.25}
{'loss': 0.4521, 'learning_rate': 1.2e-05, 'epoch': 0.5}
{'loss': 0.2156, 'learning_rate': 1.5e-05, 'epoch': 0.75}
{'eval_loss': 0.1823, 'epoch': 1.0}
...
{'loss': 0.0423, 'learning_rate': 3e-06, 'epoch': 3.75}
{'eval_loss': 0.0512, 'epoch': 4.0}
Training complete in 24:18
Model saved to task_api_agent_final/
Phase 3: Evaluation Benchmark (25 minutes)
Design Test Suite
Create comprehensive test cases covering all success criteria:
EVALUATION_SUITE = {
"tool_selection": [
{
"input": "Create a task to review the budget",
"expected_tool": "create_task",
"category": "creation"
},
{
"input": "What tasks do I have due this week?",
"expected_tool": "list_tasks",
"category": "query"
},
{
"input": "Mark the budget review as done",
"expected_tool": "complete_task",
"category": "update"
},
{
"input": "Change the priority of task 123 to high",
"expected_tool": "update_task",
"category": "update"
},
# ... 50+ test cases covering all tools
],
"argument_extraction": [
{
"input": "Create a high-priority task called 'Review Q4' due Friday",
"expected_tool": "create_task",
"expected_args": {
"title": "Review Q4",
"priority": "high",
"due_date": "2024-01-19" # Relative date resolution
}
},
{
"input": "List my tasks tagged with 'urgent'",
"expected_tool": "list_tasks",
"expected_args": {
"tags": ["urgent"]
}
},
# ... 50+ test cases with varied arguments
],
"multi_tool_chains": [
{
"input": "Create a project called Q1 Goals and add a task for budgeting",
"expected_chain": ["create_project", "create_task"],
"dependency": {"create_task": {"project_id": "from:create_project"}}
},
{
"input": "What tasks and meetings do I have today?",
"expected_parallel": ["list_tasks", "get_calendar"]
},
# ... 30+ multi-tool scenarios
]
}
Run Evaluation
def evaluate_model(model_path: str, test_suite: dict) -> dict:
"""Run comprehensive evaluation."""
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
results = {
"tool_selection": {"correct": 0, "total": 0},
"argument_extraction": {"correct": 0, "total": 0},
"json_validity": {"valid": 0, "total": 0},
"multi_tool": {"complete": 0, "total": 0},
"details": []
}
# Tool selection tests
for test in test_suite["tool_selection"]:
result = run_inference(model, tokenizer, test["input"])
results["tool_selection"]["total"] += 1
results["json_validity"]["total"] += 1
if is_valid_json(result):
results["json_validity"]["valid"] += 1
if extract_tool_name(result) == test["expected_tool"]:
results["tool_selection"]["correct"] += 1
else:
results["details"].append({
"test": test["input"],
"expected": test["expected_tool"],
"got": extract_tool_name(result)
})
# Argument extraction tests
for test in test_suite["argument_extraction"]:
result = run_inference(model, tokenizer, test["input"])
results["argument_extraction"]["total"] += 1
results["json_validity"]["total"] += 1
if is_valid_json(result):
results["json_validity"]["valid"] += 1
if arguments_match(result, test["expected_args"]):
results["argument_extraction"]["correct"] += 1
# Multi-tool tests
for test in test_suite["multi_tool_chains"]:
result = run_multi_turn(model, tokenizer, test["input"])
results["multi_tool"]["total"] += 1
if "expected_chain" in test:
if chain_complete(result, test["expected_chain"]):
results["multi_tool"]["complete"] += 1
elif "expected_parallel" in test:
if parallel_used(result, test["expected_parallel"]):
results["multi_tool"]["complete"] += 1
return results
# Run evaluation
results = evaluate_model("task_api_agent_final", EVALUATION_SUITE)
Evaluation Report
def print_evaluation_report(results: dict, spec_targets: dict):
"""Print formatted evaluation report."""
print("=" * 60)
print("TASK API AGENTIC MODEL - EVALUATION REPORT")
print("=" * 60)
metrics = [
("Tool Selection", results["tool_selection"]["correct"],
results["tool_selection"]["total"], spec_targets["tool_selection"]),
("Argument Extraction", results["argument_extraction"]["correct"],
results["argument_extraction"]["total"], spec_targets["argument_extraction"]),
("JSON Validity", results["json_validity"]["valid"],
results["json_validity"]["total"], spec_targets["json_validity"]),
("Multi-Tool Completion", results["multi_tool"]["complete"],
results["multi_tool"]["total"], spec_targets["multi_tool"]),
]
all_pass = True
for name, correct, total, target in metrics:
rate = correct / total * 100
status = "PASS" if rate >= target * 100 else "FAIL"
if status == "FAIL":
all_pass = False
print(f"{name:25} {correct:3}/{total:3} ({rate:5.1f}%) Target: {target*100:.0f}% [{status}]")
print("=" * 60)
print(f"OVERALL: {'PASS - Ready for deployment' if all_pass else 'FAIL - Needs improvement'}")
if not all_pass and results["details"]:
print("\nFailed cases (sample):")
for detail in results["details"][:5]:
print(f" Input: {detail['test'][:50]}...")
print(f" Expected: {detail['expected']}, Got: {detail['got']}")
spec_targets = {
"tool_selection": 0.95,
"argument_extraction": 0.90,
"json_validity": 0.99,
"multi_tool": 0.85
}
print_evaluation_report(results, spec_targets)
Output:
============================================================
TASK API AGENTIC MODEL - EVALUATION REPORT
============================================================
Tool Selection 48/ 50 (96.0%) Target: 95% [PASS]
Argument Extraction 46/ 50 (92.0%) Target: 90% [PASS]
JSON Validity 99/100 (99.0%) Target: 99% [PASS]
Multi-Tool Completion 27/ 30 (90.0%) Target: 85% [PASS]
============================================================
OVERALL: PASS - Ready for deployment
Phase 4: Agent Framework Integration (15 minutes)
Configure as OpenAI Agents SDK Backend
For local models, use a compatibility layer:
from openai import OpenAI
import subprocess
import time
# Start local inference server (vLLM or similar)
def start_inference_server(model_path: str, port: int = 8000):
"""Start vLLM server with the fine-tuned model."""
cmd = [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", model_path,
"--port", str(port),
"--dtype", "float16",
]
process = subprocess.Popen(cmd)
time.sleep(30) # Wait for server startup
return process
# Connect OpenAI client to local server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-for-local"
)
# Verify connection
models = client.models.list()
print(f"Available models: {[m.id for m in models.data]}")
Output:
Available models: ['task_api_agent_final']
Create Agent with Custom Model
from agents import Agent, Tool, function_tool
# Define Task API tools
@function_tool
def create_task(title: str, priority: str = "medium", due_date: str = None) -> dict:
"""Create a new task in the Task API."""
# Actual implementation would call Task API
return {"task_id": f"task_{hash(title) % 10000}", "title": title, "priority": priority}
@function_tool
def list_tasks(due_before: str = None, priority: str = None) -> list:
"""List tasks from the Task API."""
# Actual implementation would call Task API
return [{"task_id": "task_001", "title": "Sample task", "priority": "high"}]
@function_tool
def complete_task(task_id: str) -> dict:
"""Mark a task as complete."""
return {"task_id": task_id, "status": "completed"}
# Create agent with custom model
task_agent = Agent(
name="TaskMaster Agent",
model="task_api_agent_final", # Your fine-tuned model
instructions="""You are TaskMaster, a productivity assistant.
Use the available tools to manage tasks. Be encouraging and helpful.
Always use tools for task operations - don't just describe what you would do.""",
tools=[create_task, list_tasks, complete_task],
)
Test End-to-End Workflow
from agents import Runner
async def test_agent_workflow():
"""Test complete agent workflow with custom model."""
runner = Runner()
test_cases = [
"Create a high-priority task to review the Q4 budget",
"What tasks do I have right now?",
"Mark task_001 as complete",
]
for user_input in test_cases:
print(f"\nUser: {user_input}")
result = await runner.run(task_agent, user_input)
print(f"Agent: {result.final_output}")
print(f"Tools called: {[t.name for t in result.tool_calls]}")
# Run test
import asyncio
asyncio.run(test_agent_workflow())
Output:
User: Create a high-priority task to review the Q4 budget
Agent: I've created a high-priority task "Review the Q4 budget" for you.
Tools called: ['create_task']
User: What tasks do I have right now?
Agent: You have 1 task: "Sample task" (high priority).
Tools called: ['list_tasks']
User: Mark task_001 as complete
Agent: Done! Task task_001 is now marked as complete.
Tools called: ['complete_task']
Checkpoint: Production Readiness Checklist
Before declaring the capstone complete:
| Criterion | Status | Evidence |
|---|---|---|
| Dataset complete (500+ examples) | Check | 600 examples aggregated |
| Distribution balanced | Check | 75/25 single/multi-tool |
| Training converged | Check | Loss < 0.1, no divergence |
| Tool selection >95% | Check | 96.0% in evaluation |
| Argument extraction >90% | Check | 92.0% in evaluation |
| JSON validity >99% | Check | 99.0% in evaluation |
| Multi-tool completion >85% | Check | 90.0% in evaluation |
| Agent framework integration | Check | OpenAI SDK compatible |
| Latency acceptable | Check | <500ms p95 |
All criteria met: Model is production-ready.
Packaging as Digital FTE
Your agentic model is now a Digital FTE component. To monetize:
Option 1: API Service
# docker-compose.yml for hosted deployment
services:
task-api-agent:
image: vllm/vllm-openai:latest
command: --model /models/task_api_agent_final
volumes:
- ./task_api_agent_final:/models/task_api_agent_final
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Option 2: Skill Package
Bundle as a reusable skill for other agent builders:
# task_api_agent_skill.yaml
name: task-api-agentic-backend
version: 1.0.0
description: Fine-tuned model for Task API tool-calling
model:
path: ./task_api_agent_final
format: safetensors
base: Llama-3.2-3B-Instruct
capabilities:
- tool-calling
- multi-tool-orchestration
- json-structured-output
tools_supported:
- create_task
- list_tasks
- update_task
- complete_task
- delete_task
- create_project
metrics:
tool_selection: 0.96
argument_extraction: 0.92
json_validity: 0.99
multi_tool_completion: 0.90
Reflect on Your Skill
Your agentic-tuning skill is now complete. Review what you've built:
- Data generation patterns: Structured output, function calling, multi-tool chains
- Training configuration: Hyperparameters optimized for agentic tasks
- Evaluation framework: Metrics covering all agentic capabilities
- Integration patterns: Framework compatibility and deployment options
This skill is reusable for any tool-calling model you build in the future.
Try With AI
Prompt 1: Diagnose Evaluation Failures
My agentic model passed overall but has specific failure patterns:
Tool Selection: 96% (PASS)
Argument Extraction: 92% (PASS)
JSON Validity: 99% (PASS)
Multi-Tool Completion: 72% (FAIL - target 85%)
The multi-tool failures are mostly in 3+ tool chains. Two-tool chains work fine.
Help me diagnose:
1. What's likely wrong with my training data?
2. What specific examples should I add?
3. How do I validate the fix before full retraining?
What you're learning: Iterative improvement—using evaluation results to guide targeted fixes.
Prompt 2: Optimize for Production
My Task API agentic model works but inference is slow:
- Current: 800ms p95 latency
- Target: <500ms p95
My setup:
- Model: Llama-3.2-3B with LoRA merged
- Hardware: RTX 4090
- Server: vLLM
What optimization options should I explore?
Consider quantization, batching, prompt caching, and model distillation.
Help me create a testing plan to find the best latency/quality tradeoff.
What you're learning: Production optimization—balancing quality against operational requirements.
Prompt 3: Extend to New Domain
I've built a Task API agentic model. Now I want to apply the same approach
to a different domain: Customer Support with tools like:
- lookup_customer(email)
- get_order_history(customer_id)
- create_ticket(customer_id, issue)
- send_response(ticket_id, message)
Walk me through adapting my agentic-tuning skill:
1. What dataset patterns transfer directly?
2. What new patterns do I need for support-specific scenarios?
3. What evaluation metrics should change?
4. How can I reuse my existing codebase?
What you're learning: Pattern transfer—applying agentic tuning methodology to new domains.
Safety Note
Your agentic model can execute real operations when connected to production APIs. Before deployment, implement rate limiting, confirmation flows for destructive operations, and audit logging. A model that reliably calls delete_task is powerful—ensure that power is constrained by appropriate guardrails. Never deploy an agentic model with write access to production systems without human-in-the-loop confirmation for high-risk operations.