OpenAI SDK with Custom Ollama Backend
Your LiteLLM proxy is running. Your Task API model is serving requests. But writing raw HTTP calls for every interaction creates fragile, hard-to-maintain code.
The OpenAI Python SDK provides a battle-tested interface used by millions of developers. By pointing it at your LiteLLM proxy, you get professional-grade tooling for free—type hints, streaming, retries, and error handling—while running your own model.
This lesson shows you how to use the same SDK that powers GPT-4 applications with your custom fine-tuned model. Your code becomes portable: swap backends by changing two parameters, not rewriting your application.
The SDK Advantage
Compare the approaches:
Raw HTTP (Fragile)
import requests
import json
def create_task(prompt: str) -> str:
response = requests.post(
"http://localhost:4000/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer sk-local-dev-key"
},
json={
"model": "task-api-model",
"messages": [{"role": "user", "content": prompt}]
},
timeout=30
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
Output:
>>> create_task("Create a task for reviewing budget")
"I'll create a high-priority task for reviewing the budget..."
Problems:
- No type hints
- Manual JSON parsing
- Basic error handling
- No streaming support built-in
- Retry logic must be implemented manually
OpenAI SDK (Robust)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
response = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": "Create a task for reviewing budget"}]
)
print(response.choices[0].message.content)
Output:
I'll create a high-priority task for reviewing the budget...
Benefits:
- Full type hints and IDE autocomplete
- Automatic response parsing into typed objects
- Built-in retry logic with exponential backoff
- Streaming support
- Consistent interface across all backends
Setting Up the SDK Client
Installation
pip install openai
Output:
Collecting openai
Downloading openai-1.58.0-py3-none-any.whl (456 kB)
Successfully installed openai-1.58.0
Client Configuration
The key insight: base_url redirects all SDK calls to your proxy.
from openai import OpenAI
# Point to LiteLLM proxy instead of api.openai.com
client = OpenAI(
base_url="http://localhost:4000/v1", # Your LiteLLM proxy
api_key="sk-local-dev-key" # Your proxy's master key
)
What happens under the hood:
| SDK Call | Without base_url | With base_url |
|---|---|---|
chat.completions.create() | api.openai.com/v1/chat/completions | localhost:4000/v1/chat/completions |
models.list() | api.openai.com/v1/models | localhost:4000/v1/models |
Your code looks identical to OpenAI API code. Only the endpoint changes.
Synchronous Completions
Basic Request
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
def create_task(description: str) -> str:
"""Create a task using the Task API model."""
response = client.chat.completions.create(
model="task-api-model",
messages=[
{
"role": "system",
"content": "You are TaskMaster, a helpful task management assistant."
},
{
"role": "user",
"content": description
}
],
temperature=0.7,
max_tokens=500
)
return response.choices[0].message.content
# Usage
result = create_task("Create a high-priority task for Q4 budget review")
print(result)
Output:
I'll create that task for you:
**Task Created:**
- Title: Q4 Budget Review
- Priority: High
- Status: Pending
- Due: End of quarter
Would you like me to add any subtasks or set a specific deadline?
Multi-Turn Conversations
The SDK maintains message history naturally:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
class TaskAssistant:
def __init__(self):
self.messages = [
{"role": "system", "content": "You are TaskMaster, a task management assistant."}
]
def chat(self, user_message: str) -> str:
"""Send message and get response, maintaining conversation history."""
self.messages.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="task-api-model",
messages=self.messages
)
assistant_message = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": assistant_message})
return assistant_message
# Multi-turn conversation
assistant = TaskAssistant()
print(assistant.chat("Create a task for project review"))
print("---")
print(assistant.chat("Set the priority to urgent"))
print("---")
print(assistant.chat("Add a due date of next Friday"))
Output:
I'll create a task for your project review:
**Task Created:**
- Title: Project Review
- Priority: Normal
- Status: Pending
---
I've updated the task priority:
**Task Updated:**
- Title: Project Review
- Priority: Urgent ⚠️
- Status: Pending
---
I've added the due date:
**Task Updated:**
- Title: Project Review
- Priority: Urgent ⚠️
- Due: Friday, January 10, 2026
- Status: Pending
Streaming Responses
For real-time UX, stream tokens as they're generated:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
def stream_task_response(prompt: str):
"""Stream response tokens in real-time."""
stream = client.chat.completions.create(
model="task-api-model",
messages=[
{"role": "system", "content": "You are TaskMaster."},
{"role": "user", "content": prompt}
],
stream=True # Enable streaming
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print() # Newline after stream completes
return full_response
# Usage
response = stream_task_response("List my top 3 pending tasks")
Output (appears progressively):
Here are your top 3 pending tasks:
1. **Q4 Budget Review** - Priority: Urgent, Due: Jan 10
2. **Project Documentation** - Priority: High, Due: Jan 15
3. **Team Sync Meeting Prep** - Priority: Normal, Due: Jan 8
Would you like to update any of these?
Streaming with Callback
For UI integration:
from openai import OpenAI
from typing import Callable
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
def stream_with_callback(
prompt: str,
on_token: Callable[[str], None],
on_complete: Callable[[str], None]
):
"""Stream with callbacks for UI integration."""
stream = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": prompt}],
stream=True
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
on_token(token)
full_response += token
on_complete(full_response)
# Example usage with callbacks
def handle_token(token: str):
# Update UI progressively
print(f"Token: {repr(token)}")
def handle_complete(response: str):
print(f"\nComplete response length: {len(response)} chars")
stream_with_callback(
"Create a quick task",
on_token=handle_token,
on_complete=handle_complete
)
Output:
Token: 'I'
Token: "'ll"
Token: ' create'
Token: ' a'
Token: ' quick'
Token: ' task'
Token: '...'
Complete response length: 145 chars
Error Handling
The SDK provides typed exceptions:
from openai import OpenAI, APIError, APIConnectionError, RateLimitError
import time
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
def robust_completion(prompt: str, max_retries: int = 3) -> str:
"""Make completion request with retry logic."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": prompt}],
timeout=30.0
)
return response.choices[0].message.content
except APIConnectionError as e:
# Proxy or Ollama not reachable
print(f"Connection error (attempt {attempt + 1}): {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
except RateLimitError as e:
# Too many requests
print(f"Rate limited (attempt {attempt + 1}): {e}")
time.sleep(5)
except APIError as e:
# Other API errors
print(f"API error: {e}")
raise
# Usage
try:
result = robust_completion("Create a task")
print(result)
except Exception as e:
print(f"Failed after retries: {e}")
Output (when Ollama is down):
Connection error (attempt 1): Connection refused
Connection error (attempt 2): Connection refused
Connection error (attempt 3): Connection refused
Failed after retries: Connection refused
Response Object Structure
Understanding the response structure helps with debugging:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)
response = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": "Create a task"}]
)
# Explore the response object
print(f"ID: {response.id}")
print(f"Model: {response.model}")
print(f"Created: {response.created}")
print(f"Choices: {len(response.choices)}")
print(f"Content: {response.choices[0].message.content[:50]}...")
print(f"Finish Reason: {response.choices[0].finish_reason}")
print(f"Usage - Prompt Tokens: {response.usage.prompt_tokens}")
print(f"Usage - Completion Tokens: {response.usage.completion_tokens}")
print(f"Usage - Total Tokens: {response.usage.total_tokens}")
Output:
ID: chatcmpl-abc123xyz
Model: task-api-model
Created: 1735800000
Choices: 1
Content: I'll create that task for you. Here's what I've...
Finish Reason: stop
Usage - Prompt Tokens: 12
Usage - Completion Tokens: 85
Usage - Total Tokens: 97
Backend Portability
The power of this approach: swap backends without code changes.
from openai import OpenAI
import os
def get_client(backend: str = "local") -> OpenAI:
"""Get configured client for different backends."""
configs = {
"local": {
"base_url": "http://localhost:4000/v1",
"api_key": "sk-local-dev-key"
},
"openai": {
"base_url": "https://api.openai.com/v1",
"api_key": os.environ.get("OPENAI_API_KEY")
},
"staging": {
"base_url": "https://staging-llm.yourcompany.com/v1",
"api_key": os.environ.get("STAGING_API_KEY")
}
}
config = configs.get(backend, configs["local"])
return OpenAI(**config)
# Same code works with any backend
def create_task(client: OpenAI, description: str, model: str) -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": description}]
)
return response.choices[0].message.content
# Development: use local model
local_client = get_client("local")
result = create_task(local_client, "Create a task", "task-api-model")
# Production: switch to OpenAI for comparison
# openai_client = get_client("openai")
# result = create_task(openai_client, "Create a task", "gpt-4o-mini")
Output:
>>> create_task(local_client, "Create a task", "task-api-model")
"I'll create that task for you..."
Update Your Skill
Add to your agent-integration skill:
Add a section on "OpenAI SDK Integration" with:
- Client configuration pattern (base_url override)
- Streaming implementation template
- Error handling with retry logic
- Backend portability factory pattern
Try With AI
Prompt 1: Build a Task Management CLI
Create a command-line task management application using the OpenAI SDK
with my Task API model. The CLI should support:
- Creating tasks with natural language
- Listing tasks
- Updating task priority
- Multi-turn conversation mode
Use the client configuration:
- base_url: http://localhost:4000/v1
- model: task-api-model
Include error handling for when the proxy is unavailable.
What you're learning: Building complete applications using SDK patterns with your custom model.
Prompt 2: Implement Conversation Memory
I want to build a task assistant that remembers context across sessions.
Currently I'm using the OpenAI SDK with LiteLLM proxy.
Help me design a conversation memory system that:
1. Saves conversation history to a JSON file
2. Loads history on startup
3. Trims history when it exceeds token limits
4. Maintains system prompt consistency
Show me the implementation with my task-api-model.
What you're learning: Extending SDK patterns with persistent state management.
Prompt 3: Compare Response Quality
I want to compare responses between my custom task-api-model and GPT-4o-mini
for the same prompts. Help me create a comparison script that:
1. Sends the same 5 task-related prompts to both models
2. Measures response time for each
3. Displays responses side-by-side
4. Calculates average latency difference
Use the backend portability pattern to switch between:
- Local: http://localhost:4000/v1, model: task-api-model
- OpenAI: https://api.openai.com/v1, model: gpt-4o-mini
What you're learning: Systematic comparison methodology for validating custom model quality against baselines.