Skip to main content

Custom Models as Agent Backends

Your Task API model runs via Ollama. But how does it become the brain of an agent system? This lesson explains how custom models integrate into agent architectures as reasoning engines.

Understanding this architecture is essential before you configure proxies, SDKs, and tool calling in the following lessons.

The Agent Architecture

An agent system has three core components:

┌─────────────────────────────────────────────────────────┐
│ AGENT SYSTEM │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────┐ │
│ │ TOOLS │────▶│ REASONING │────▶│ OUTPUT │ │
│ │ │ │ ENGINE │ │ │ │
│ │ - APIs │◀────│ │◀────│ Actions │ │
│ │ - Functions │ │ (LLM Model) │ │ Results │ │
│ │ - MCP │ └──────────────┘ └─────────┘ │
│ └──────────────┘ │ │
│ │ │
│ ┌────────────▼────────────┐ │
│ │ MEMORY/STATE │ │
│ │ Conversation History │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
ComponentRoleYour Custom Model's Job
Reasoning EngineDecides what to do, interprets resultsThis is where your Task API model runs
ToolsExecute actions (API calls, functions, MCP)Model tells agent which tool to call
MemoryMaintains context across turnsModel receives history as context
OutputFinal response or action resultModel generates structured output

Your custom model replaces the default reasoning engine. Instead of GPT-4 or Claude deciding what actions to take, your fine-tuned Task API model makes those decisions.

Why Replace the Default Model?

Foundation models like GPT-4 and Claude are excellent general-purpose reasoners. Why would you swap them for a custom model?

Cost Reduction

ModelCost per 1M TokensMonthly Cost (1M requests)
GPT-4o$5.00 input / $15.00 output~$10,000
GPT-4o-mini$0.15 input / $0.60 output~$375
Custom (Ollama Local)$0Hardware only
Custom (Cloud)~$0.10 - $0.50~$100 - $500

For high-volume applications, custom models reduce costs by 10-100x.

Latency Control

Foundation model APIs depend on:

  • Network round-trip time (50-200ms)
  • API queue wait times (variable)
  • Rate limiting during high traffic

Local models provide:

  • No network latency (local inference)
  • Predictable response times
  • No rate limits

For real-time applications (voice assistants, live coding), local latency is critical.

Domain Specialization

Your Task API model was fine-tuned on task management conversations. It understands:

  • Task creation, updates, and completion workflows
  • Priority classification specific to your domain
  • Your organization's task vocabulary

A foundation model needs extensive prompting to match this. Your model does it by default.

Data Privacy

Some use cases require:

  • Data never leaves your infrastructure
  • No third-party API calls
  • Complete audit trails

Local models satisfy these requirements by design.

Integration Patterns

Custom models connect to agent frameworks through three main patterns:

Pattern 1: Direct Ollama Integration

┌─────────────────┐         ┌───────────────────┐
│ Agent Code │◀───────▶│ Ollama Server │
│ (Python SDK) │ HTTP │ localhost:11434 │
└─────────────────┘ └───────────────────┘

How it works:

  • Agent code calls Ollama's REST API directly
  • No intermediate proxy
  • Simplest setup for local development

Code example:

import requests

def generate(prompt: str) -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "task-api-model",
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]

Output:

>>> generate("Create a task for reviewing Q4 budget")
"I'll create a high-priority task for reviewing the Q4 budget..."

Best for: Simple agents, prototyping, learning

Pattern 2: LiteLLM Proxy

┌─────────────────┐         ┌───────────────────┐         ┌───────────────────┐
│ Agent Code │◀───────▶│ LiteLLM Proxy │◀───────▶│ Ollama Server │
│ (OpenAI SDK) │ HTTP │ localhost:4000 │ HTTP │ localhost:11434 │
└─────────────────┘ └───────────────────┘ └───────────────────┘

How it works:

  • LiteLLM provides OpenAI-compatible API
  • Agent code uses standard OpenAI SDK
  • Proxy translates requests to Ollama format

Why use a proxy?

BenefitExplanation
SDK CompatibilityUse OpenAI SDK without code changes
Model SwitchingChange backend without changing code
Fallback SupportAuto-fallback to GPT-4 if local fails
Unified InterfaceSame API for Ollama, vLLM, cloud models

Best for: Production agents, SDK compatibility, multi-model setups

Pattern 3: Cloud Deployment

┌─────────────────┐         ┌───────────────────┐
│ Agent Code │◀───────▶│ Cloud Endpoint │
│ (OpenAI SDK) │ HTTPS │ your-api.com │
└─────────────────┘ └───────────────────┘

How it works:

  • Custom model deployed on cloud infrastructure
  • Exposes OpenAI-compatible endpoint
  • Agent code connects via HTTPS

When to use:

  • Multiple clients need access
  • Scaling beyond single machine
  • Geographic distribution

Best for: Production services, team access, high availability

Choosing Your Integration Pattern

Use this decision framework:

Are you developing locally?
├── Yes ──▶ Pattern 1 (Direct Ollama) for prototyping
│ Pattern 2 (LiteLLM) when adding SDK features

└── No ───▶ Do you need OpenAI SDK compatibility?
├── Yes ──▶ Pattern 2 (LiteLLM Proxy) or Pattern 3 (Cloud)
└── No ───▶ Pattern 1 with custom client

For this chapter, we use Pattern 2 (LiteLLM Proxy) because:

  1. It provides OpenAI SDK compatibility
  2. It enables the OpenAI Agents SDK integration
  3. It supports fallback to foundation models
  4. It's the standard for production agent systems

The Agent Loop with Custom Backend

Here's how your custom model fits into the agent execution loop:

1. User Input


2. Agent receives input + conversation history


3. Agent calls YOUR MODEL via LiteLLM proxy


4. Your model reasons about the task:
- What does the user want?
- Which tool should I call?
- What parameters should I pass?


5. Your model returns structured output:
{
"tool": "create_task",
"arguments": {
"title": "Review Q4 budget",
"priority": "high"
}
}


6. Agent executes tool with provided arguments


7. Tool returns result


8. Agent calls YOUR MODEL again with tool result


9. Your model generates final response


10. User receives response

Your model participates in steps 3-4, 8-9. It decides which tools to use and how to respond.

What Makes Custom Backends Different

When you use GPT-4 as your reasoning engine, Anthropic or OpenAI handles:

  • Model hosting and scaling
  • API availability
  • Response quality
  • Tool calling format

When you use a custom backend, you handle:

  • Model deployment (Ollama, vLLM)
  • Response quality (your fine-tuning)
  • Tool calling accuracy (structured output training)
  • Fallback strategies (error handling)

This chapter teaches you to handle all four.

Update Your Skill

After completing this lesson, update your agent-integration skill with:

Add a section on "Integration Pattern Selection" with:
- Decision framework for choosing patterns
- Comparison table of direct vs proxy vs cloud
- When to use each pattern

Try With AI

Prompt 1: Analyze Your Setup

I have my Task API model running on Ollama at localhost:11434. I'm building
a task management agent that needs:
- Tool calling for CRUD operations
- Sub-500ms latency for real-time UX
- Fallback to GPT-4o-mini when my model fails

Which integration pattern should I use and why? Walk me through the trade-offs.

What you're learning: Applying decision frameworks to your specific requirements.

Prompt 2: Map the Agent Loop

Trace through the 10-step agent loop for this scenario:

User says: "Create a task to review the marketing proposal by Friday"

My Task API model is the reasoning engine. Show me:
1. What the model receives at each step
2. What it outputs
3. How the tool execution works

Be specific about the JSON structures involved.

What you're learning: Understanding agent execution flow with concrete examples.

Prompt 3: Compare Cost Models

Help me calculate the monthly cost comparison for my task management agent:
- Expected volume: 50,000 requests/month
- Average request: 500 input tokens, 200 output tokens
- Currently using GPT-4o-mini

Compare costs for:
1. Staying with GPT-4o-mini
2. Switching to local Ollama (include hardware costs)
3. Hybrid: Custom model with GPT-4 fallback (10% fallback rate)

Which makes financial sense for my use case?

What you're learning: Quantifying the value proposition of custom backends.