Updated Feb 23, 2026

Capstone - Build Task API Agentic Model

You've learned the components: structured output training, function calling patterns, multi-tool orchestration. Now you'll integrate everything into a complete agentic model for Task API—your own Digital FTE for task management.

This capstone is Layer 4: Spec-Driven Integration. You'll work from a specification, compose skills from previous lessons, and produce a production-ready model that can be packaged as a sellable Digital FTE.

The Specification

Task API Agentic Model Specification

Intent: Create a fine-tuned model capable of serving as the reasoning backend for a Task API agent—selecting appropriate tools, extracting arguments from natural language, and orchestrating multi-step workflows.

Success Criteria:

Metric	Target	Measurement
Tool selection accuracy	>95%	Correct tool chosen for request
Argument extraction accuracy	>90%	All required args correct
JSON validity	>99%	Parseable without errors
Multi-tool completion	>85%	Chains executed fully
Latency (local inference)	<500ms	p95 response time

Constraints:

Base model: Llama-3.2-3B-Instruct (or equivalent 3B-8B model)
Training data: 500+ examples covering all patterns
Must work with OpenAI Agents SDK tool_calls format
No external API dependencies at inference time

Non-Goals:

General conversation capability (focus on tool-calling)
Multi-language support (English only)
Streaming responses (batch is sufficient)

Phase 1: Dataset Assembly (30 minutes)

Aggregate Your Training Data

Combine examples from Lessons 3-6:

import json
from pathlib import Path

def aggregate_datasets(sources: list[str], output: str) -> int:
    """Combine training examples from multiple sources."""
    all_examples = []

    for source in sources:
        with open(source) as f:
            examples = [json.loads(line) for line in f]
            all_examples.extend(examples)
            print(f"  {source}: {len(examples)} examples")

    # Shuffle for training
    import random
    random.shuffle(all_examples)

    # Write combined dataset
    with open(output, "w") as f:
        for ex in all_examples:
            f.write(json.dumps(ex) + "\n")

    print(f"Total: {len(all_examples)} examples -> {output}")
    return len(all_examples)

# Aggregate from all lesson outputs
sources = [
    "ch66_structured_outputs.jsonl",      # Lesson 3: Structured output training
    "ch66_function_calling.jsonl",        # Lesson 4: Task API function calling
    "ch66_multi_tool.jsonl",              # Lesson 6: Multi-tool orchestration
]

total = aggregate_datasets(sources, "task_api_agentic_complete.jsonl")

Output:

  ch66_structured_outputs.jsonl: 200 examples
  ch66_function_calling.jsonl: 250 examples
  ch66_multi_tool.jsonl: 150 examples
Total: 600 examples -> task_api_agentic_complete.jsonl

Validate Dataset Distribution

def analyze_distribution(dataset_path: str) -> dict:
    """Analyze training data distribution."""
    stats = {
        "total": 0,
        "single_tool": 0,
        "multi_tool": 0,
        "tools": {},
        "avg_turns": 0
    }

    total_turns = 0

    with open(dataset_path) as f:
        for line in f:
            ex = json.loads(line)
            stats["total"] += 1

            # Count tool calls
            tool_calls = []
            for msg in ex["messages"]:
                if msg.get("tool_calls"):
                    tool_calls.extend(msg["tool_calls"])
                    for tc in msg["tool_calls"]:
                        tool_name = tc["function"]["name"]
                        stats["tools"][tool_name] = stats["tools"].get(tool_name, 0) + 1

            # Categorize
            if len(tool_calls) == 1:
                stats["single_tool"] += 1
            else:
                stats["multi_tool"] += 1

            # Count turns
            total_turns += len(ex["messages"])

    stats["avg_turns"] = total_turns / stats["total"]
    return stats

stats = analyze_distribution("task_api_agentic_complete.jsonl")
print(f"Distribution:")
print(f"  Single-tool: {stats['single_tool']} ({stats['single_tool']/stats['total']*100:.1f}%)")
print(f"  Multi-tool: {stats['multi_tool']} ({stats['multi_tool']/stats['total']*100:.1f}%)")
print(f"  Avg turns: {stats['avg_turns']:.1f}")
print(f"Tool usage:")
for tool, count in sorted(stats["tools"].items(), key=lambda x: -x[1]):
    print(f"    {tool}: {count}")

Output:

Distribution:
  Single-tool: 450 (75.0%)
  Multi-tool: 150 (25.0%)
  Avg turns: 4.2
Tool usage:
    create_task: 180
    list_tasks: 145
    update_task: 120
    complete_task: 95
    delete_task: 60
    create_project: 45
    get_calendar: 35
    create_reminder: 20

Balance Check

Aspect	Current	Target	Status
Single vs multi-tool	75/25	60-80/20-40	OK
All tools represented	Yes	Yes	OK
Min per tool	20	20+	OK

If distribution is imbalanced, generate additional examples for underrepresented patterns.

Phase 2: Training Execution (20 minutes active, training runs in background)

Configure Training

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length=4096,  # Longer for multi-tool conversations
    dtype=None,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # Higher rank for complex tool-calling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Training configuration optimized for agentic fine-tuning
training_args = TrainingArguments(
    output_dir="./task_api_agent",
    num_train_epochs=4,  # More epochs for structured output
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,  # Effective batch: 8
    learning_rate=1.5e-5,  # Lower LR for precision
    warmup_ratio=0.1,
    logging_steps=25,
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    fp16=True,
    optim="adamw_8bit",
    seed=42,
)

Prepare Data Loader

from datasets import load_dataset

def format_for_training(example):
    """Convert to chat format for training."""
    # Unsloth expects specific format
    messages = example["messages"]
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": formatted}

# Load dataset
dataset = load_dataset("json", data_files={
    "train": "task_api_agentic_complete.jsonl"
})

# Split for validation
dataset = dataset["train"].train_test_split(test_size=0.1, seed=42)

# Format
dataset = dataset.map(format_for_training)

print(f"Train: {len(dataset['train'])} examples")
print(f"Validation: {len(dataset['test'])} examples")

Output:

Train: 540 examples
Validation: 60 examples

Execute Training

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=training_args,
    dataset_text_field="text",
    max_seq_length=4096,
)

# Train
print("Starting agentic fine-tuning...")
trainer.train()

# Save
model.save_pretrained("task_api_agent_final")
tokenizer.save_pretrained("task_api_agent_final")
print("Model saved to task_api_agent_final/")

Output:

Starting agentic fine-tuning...
{'loss': 0.7823, 'learning_rate': 3e-06, 'epoch': 0.25}
{'loss': 0.4521, 'learning_rate': 1.2e-05, 'epoch': 0.5}
{'loss': 0.2156, 'learning_rate': 1.5e-05, 'epoch': 0.75}
{'eval_loss': 0.1823, 'epoch': 1.0}
...
{'loss': 0.0423, 'learning_rate': 3e-06, 'epoch': 3.75}
{'eval_loss': 0.0512, 'epoch': 4.0}
Training complete in 24:18
Model saved to task_api_agent_final/

Phase 3: Evaluation Benchmark (25 minutes)

Design Test Suite

Create comprehensive test cases covering all success criteria:

EVALUATION_SUITE = {
    "tool_selection": [
        {
            "input": "Create a task to review the budget",
            "expected_tool": "create_task",
            "category": "creation"
        },
        {
            "input": "What tasks do I have due this week?",
            "expected_tool": "list_tasks",
            "category": "query"
        },
        {
            "input": "Mark the budget review as done",
            "expected_tool": "complete_task",
            "category": "update"
        },
        {
            "input": "Change the priority of task 123 to high",
            "expected_tool": "update_task",
            "category": "update"
        },
        # ... 50+ test cases covering all tools
    ],

    "argument_extraction": [
        {
            "input": "Create a high-priority task called 'Review Q4' due Friday",
            "expected_tool": "create_task",
            "expected_args": {
                "title": "Review Q4",
                "priority": "high",
                "due_date": "2024-01-19"  # Relative date resolution
            }
        },
        {
            "input": "List my tasks tagged with 'urgent'",
            "expected_tool": "list_tasks",
            "expected_args": {
                "tags": ["urgent"]
            }
        },
        # ... 50+ test cases with varied arguments
    ],

    "multi_tool_chains": [
        {
            "input": "Create a project called Q1 Goals and add a task for budgeting",
            "expected_chain": ["create_project", "create_task"],
            "dependency": {"create_task": {"project_id": "from:create_project"}}
        },
        {
            "input": "What tasks and meetings do I have today?",
            "expected_parallel": ["list_tasks", "get_calendar"]
        },
        # ... 30+ multi-tool scenarios
    ]
}

Run Evaluation

def evaluate_model(model_path: str, test_suite: dict) -> dict:
    """Run comprehensive evaluation."""
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # Load model
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    results = {
        "tool_selection": {"correct": 0, "total": 0},
        "argument_extraction": {"correct": 0, "total": 0},
        "json_validity": {"valid": 0, "total": 0},
        "multi_tool": {"complete": 0, "total": 0},
        "details": []
    }

    # Tool selection tests
    for test in test_suite["tool_selection"]:
        result = run_inference(model, tokenizer, test["input"])
        results["tool_selection"]["total"] += 1
        results["json_validity"]["total"] += 1

        if is_valid_json(result):
            results["json_validity"]["valid"] += 1
            if extract_tool_name(result) == test["expected_tool"]:
                results["tool_selection"]["correct"] += 1
            else:
                results["details"].append({
                    "test": test["input"],
                    "expected": test["expected_tool"],
                    "got": extract_tool_name(result)
                })

    # Argument extraction tests
    for test in test_suite["argument_extraction"]:
        result = run_inference(model, tokenizer, test["input"])
        results["argument_extraction"]["total"] += 1
        results["json_validity"]["total"] += 1

        if is_valid_json(result):
            results["json_validity"]["valid"] += 1
            if arguments_match(result, test["expected_args"]):
                results["argument_extraction"]["correct"] += 1

    # Multi-tool tests
    for test in test_suite["multi_tool_chains"]:
        result = run_multi_turn(model, tokenizer, test["input"])
        results["multi_tool"]["total"] += 1

        if "expected_chain" in test:
            if chain_complete(result, test["expected_chain"]):
                results["multi_tool"]["complete"] += 1
        elif "expected_parallel" in test:
            if parallel_used(result, test["expected_parallel"]):
                results["multi_tool"]["complete"] += 1

    return results

# Run evaluation
results = evaluate_model("task_api_agent_final", EVALUATION_SUITE)

Evaluation Report

def print_evaluation_report(results: dict, spec_targets: dict):
    """Print formatted evaluation report."""
    print("=" * 60)
    print("TASK API AGENTIC MODEL - EVALUATION REPORT")
    print("=" * 60)

    metrics = [
        ("Tool Selection", results["tool_selection"]["correct"],
         results["tool_selection"]["total"], spec_targets["tool_selection"]),
        ("Argument Extraction", results["argument_extraction"]["correct"],
         results["argument_extraction"]["total"], spec_targets["argument_extraction"]),
        ("JSON Validity", results["json_validity"]["valid"],
         results["json_validity"]["total"], spec_targets["json_validity"]),
        ("Multi-Tool Completion", results["multi_tool"]["complete"],
         results["multi_tool"]["total"], spec_targets["multi_tool"]),
    ]

    all_pass = True
    for name, correct, total, target in metrics:
        rate = correct / total * 100
        status = "PASS" if rate >= target * 100 else "FAIL"
        if status == "FAIL":
            all_pass = False
        print(f"{name:25} {correct:3}/{total:3} ({rate:5.1f}%) Target: {target*100:.0f}% [{status}]")

    print("=" * 60)
    print(f"OVERALL: {'PASS - Ready for deployment' if all_pass else 'FAIL - Needs improvement'}")

    if not all_pass and results["details"]:
        print("\nFailed cases (sample):")
        for detail in results["details"][:5]:
            print(f"  Input: {detail['test'][:50]}...")
            print(f"  Expected: {detail['expected']}, Got: {detail['got']}")

spec_targets = {
    "tool_selection": 0.95,
    "argument_extraction": 0.90,
    "json_validity": 0.99,
    "multi_tool": 0.85
}

print_evaluation_report(results, spec_targets)

Output:

============================================================
TASK API AGENTIC MODEL - EVALUATION REPORT
============================================================
Tool Selection            48/ 50 (96.0%) Target: 95% [PASS]
Argument Extraction       46/ 50 (92.0%) Target: 90% [PASS]
JSON Validity             99/100 (99.0%) Target: 99% [PASS]
Multi-Tool Completion     27/ 30 (90.0%) Target: 85% [PASS]
============================================================
OVERALL: PASS - Ready for deployment

Phase 4: Agent Framework Integration (15 minutes)

Configure as OpenAI Agents SDK Backend

For local models, use a compatibility layer:

from openai import OpenAI
import subprocess
import time

# Start local inference server (vLLM or similar)
def start_inference_server(model_path: str, port: int = 8000):
    """Start vLLM server with the fine-tuned model."""
    cmd = [
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", model_path,
        "--port", str(port),
        "--dtype", "float16",
    ]
    process = subprocess.Popen(cmd)
    time.sleep(30)  # Wait for server startup
    return process

# Connect OpenAI client to local server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-for-local"
)

# Verify connection
models = client.models.list()
print(f"Available models: {[m.id for m in models.data]}")

Output:

Available models: ['task_api_agent_final']

Create Agent with Custom Model

from agents import Agent, Tool, function_tool

# Define Task API tools
@function_tool
def create_task(title: str, priority: str = "medium", due_date: str = None) -> dict:
    """Create a new task in the Task API."""
    # Actual implementation would call Task API
    return {"task_id": f"task_{hash(title) % 10000}", "title": title, "priority": priority}

@function_tool
def list_tasks(due_before: str = None, priority: str = None) -> list:
    """List tasks from the Task API."""
    # Actual implementation would call Task API
    return [{"task_id": "task_001", "title": "Sample task", "priority": "high"}]

@function_tool
def complete_task(task_id: str) -> dict:
    """Mark a task as complete."""
    return {"task_id": task_id, "status": "completed"}

# Create agent with custom model
task_agent = Agent(
    name="TaskMaster Agent",
    model="task_api_agent_final",  # Your fine-tuned model
    instructions="""You are TaskMaster, a productivity assistant.
    Use the available tools to manage tasks. Be encouraging and helpful.
    Always use tools for task operations - don't just describe what you would do.""",
    tools=[create_task, list_tasks, complete_task],
)

Test End-to-End Workflow

from agents import Runner

async def test_agent_workflow():
    """Test complete agent workflow with custom model."""
    runner = Runner()

    test_cases = [
        "Create a high-priority task to review the Q4 budget",
        "What tasks do I have right now?",
        "Mark task_001 as complete",
    ]

    for user_input in test_cases:
        print(f"\nUser: {user_input}")
        result = await runner.run(task_agent, user_input)
        print(f"Agent: {result.final_output}")
        print(f"Tools called: {[t.name for t in result.tool_calls]}")

# Run test
import asyncio
asyncio.run(test_agent_workflow())

Output:

User: Create a high-priority task to review the Q4 budget
Agent: I've created a high-priority task "Review the Q4 budget" for you.
Tools called: ['create_task']

User: What tasks do I have right now?
Agent: You have 1 task: "Sample task" (high priority).
Tools called: ['list_tasks']

User: Mark task_001 as complete
Agent: Done! Task task_001 is now marked as complete.
Tools called: ['complete_task']

Checkpoint: Production Readiness Checklist

Before declaring the capstone complete:

Criterion	Status	Evidence
Dataset complete (500+ examples)	Check	600 examples aggregated
Distribution balanced	Check	75/25 single/multi-tool
Training converged	Check	Loss < 0.1, no divergence
Tool selection >95%	Check	96.0% in evaluation
Argument extraction >90%	Check	92.0% in evaluation
JSON validity >99%	Check	99.0% in evaluation
Multi-tool completion >85%	Check	90.0% in evaluation
Agent framework integration	Check	OpenAI SDK compatible
Latency acceptable	Check	<500ms p95

All criteria met: Model is production-ready.

Packaging as Digital FTE

Your agentic model is now a Digital FTE component. To monetize:

Option 1: API Service

# docker-compose.yml for hosted deployment
services:
  task-api-agent:
    image: vllm/vllm-openai:latest
    command: --model /models/task_api_agent_final
    volumes:
      - ./task_api_agent_final:/models/task_api_agent_final
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Option 2: Skill Package

Bundle as a reusable skill for other agent builders:

# task_api_agent_skill.yaml
name: task-api-agentic-backend
version: 1.0.0
description: Fine-tuned model for Task API tool-calling
model:
  path: ./task_api_agent_final
  format: safetensors
  base: Llama-3.2-3B-Instruct
capabilities:
  - tool-calling
  - multi-tool-orchestration
  - json-structured-output
tools_supported:
  - create_task
  - list_tasks
  - update_task
  - complete_task
  - delete_task
  - create_project
metrics:
  tool_selection: 0.96
  argument_extraction: 0.92
  json_validity: 0.99
  multi_tool_completion: 0.90

Reflect on Your Skill

Your agentic-tuning skill is now complete. Review what you've built:

Data generation patterns: Structured output, function calling, multi-tool chains
Training configuration: Hyperparameters optimized for agentic tasks
Evaluation framework: Metrics covering all agentic capabilities
Integration patterns: Framework compatibility and deployment options

This skill is reusable for any tool-calling model you build in the future.

Try With AI

Prompt 1: Diagnose Evaluation Failures

My agentic model passed overall but has specific failure patterns:

Tool Selection: 96% (PASS)
Argument Extraction: 92% (PASS)
JSON Validity: 99% (PASS)
Multi-Tool Completion: 72% (FAIL - target 85%)

The multi-tool failures are mostly in 3+ tool chains. Two-tool chains work fine.

Help me diagnose:
1. What's likely wrong with my training data?
2. What specific examples should I add?
3. How do I validate the fix before full retraining?

What you're learning: Iterative improvement—using evaluation results to guide targeted fixes.

Prompt 2: Optimize for Production

My Task API agentic model works but inference is slow:
- Current: 800ms p95 latency
- Target: <500ms p95

My setup:
- Model: Llama-3.2-3B with LoRA merged
- Hardware: RTX 4090
- Server: vLLM

What optimization options should I explore?
Consider quantization, batching, prompt caching, and model distillation.
Help me create a testing plan to find the best latency/quality tradeoff.

What you're learning: Production optimization—balancing quality against operational requirements.

Prompt 3: Extend to New Domain

I've built a Task API agentic model. Now I want to apply the same approach
to a different domain: Customer Support with tools like:
- lookup_customer(email)
- get_order_history(customer_id)
- create_ticket(customer_id, issue)
- send_response(ticket_id, message)

Walk me through adapting my agentic-tuning skill:
1. What dataset patterns transfer directly?
2. What new patterns do I need for support-specific scenarios?
3. What evaluation metrics should change?
4. How can I reuse my existing codebase?

What you're learning: Pattern transfer—applying agentic tuning methodology to new domains.

Safety Note

Your agentic model can execute real operations when connected to production APIs. Before deployment, implement rate limiting, confirmation flows for destructive operations, and audit logging. A model that reliably calls delete_task is powerful—ensure that power is constrained by appropriate guardrails. Never deploy an agentic model with write access to production systems without human-in-the-loop confirmation for high-risk operations.

The Specification​

Task API Agentic Model Specification​

Phase 1: Dataset Assembly (30 minutes)​

Aggregate Your Training Data​

Validate Dataset Distribution​

Balance Check​

Phase 2: Training Execution (20 minutes active, training runs in background)​

Configure Training​

Prepare Data Loader​

Execute Training​

Phase 3: Evaluation Benchmark (25 minutes)​

Design Test Suite​

Run Evaluation​

Evaluation Report​

Phase 4: Agent Framework Integration (15 minutes)​

Configure as OpenAI Agents SDK Backend​

Create Agent with Custom Model​

Test End-to-End Workflow​

Checkpoint: Production Readiness Checklist​

Packaging as Digital FTE​

Option 1: API Service​

Option 2: Skill Package​

Reflect on Your Skill​

Try With AI​

Prompt 1: Diagnose Evaluation Failures​

Prompt 2: Optimize for Production​

Prompt 3: Extend to New Domain​

Safety Note​

The Specification

Task API Agentic Model Specification

Phase 1: Dataset Assembly (30 minutes)

Aggregate Your Training Data

Validate Dataset Distribution

Balance Check

Phase 2: Training Execution (20 minutes active, training runs in background)

Configure Training

Prepare Data Loader

Execute Training

Phase 3: Evaluation Benchmark (25 minutes)

Design Test Suite

Run Evaluation

Evaluation Report

Phase 4: Agent Framework Integration (15 minutes)

Configure as OpenAI Agents SDK Backend

Create Agent with Custom Model

Test End-to-End Workflow

Checkpoint: Production Readiness Checklist

Packaging as Digital FTE

Option 1: API Service

Option 2: Skill Package

Reflect on Your Skill

Try With AI

Prompt 1: Diagnose Evaluation Failures

Prompt 2: Optimize for Production

Prompt 3: Extend to New Domain

Safety Note