Skip to main content

OpenAI SDK with Custom Ollama Backend

Your LiteLLM proxy is running. Your Task API model is serving requests. But writing raw HTTP calls for every interaction creates fragile, hard-to-maintain code.

The OpenAI Python SDK provides a battle-tested interface used by millions of developers. By pointing it at your LiteLLM proxy, you get professional-grade tooling for free—type hints, streaming, retries, and error handling—while running your own model.

This lesson shows you how to use the same SDK that powers GPT-4 applications with your custom fine-tuned model. Your code becomes portable: swap backends by changing two parameters, not rewriting your application.

The SDK Advantage

Compare the approaches:

Raw HTTP (Fragile)

import requests
import json

def create_task(prompt: str) -> str:
response = requests.post(
"http://localhost:4000/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer sk-local-dev-key"
},
json={
"model": "task-api-model",
"messages": [{"role": "user", "content": prompt}]
},
timeout=30
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]

Output:

>>> create_task("Create a task for reviewing budget")
"I'll create a high-priority task for reviewing the budget..."

Problems:

  • No type hints
  • Manual JSON parsing
  • Basic error handling
  • No streaming support built-in
  • Retry logic must be implemented manually

OpenAI SDK (Robust)

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

response = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": "Create a task for reviewing budget"}]
)

print(response.choices[0].message.content)

Output:

I'll create a high-priority task for reviewing the budget...

Benefits:

  • Full type hints and IDE autocomplete
  • Automatic response parsing into typed objects
  • Built-in retry logic with exponential backoff
  • Streaming support
  • Consistent interface across all backends

Setting Up the SDK Client

Installation

pip install openai

Output:

Collecting openai
Downloading openai-1.58.0-py3-none-any.whl (456 kB)
Successfully installed openai-1.58.0

Client Configuration

The key insight: base_url redirects all SDK calls to your proxy.

from openai import OpenAI

# Point to LiteLLM proxy instead of api.openai.com
client = OpenAI(
base_url="http://localhost:4000/v1", # Your LiteLLM proxy
api_key="sk-local-dev-key" # Your proxy's master key
)

What happens under the hood:

SDK CallWithout base_urlWith base_url
chat.completions.create()api.openai.com/v1/chat/completionslocalhost:4000/v1/chat/completions
models.list()api.openai.com/v1/modelslocalhost:4000/v1/models

Your code looks identical to OpenAI API code. Only the endpoint changes.

Synchronous Completions

Basic Request

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

def create_task(description: str) -> str:
"""Create a task using the Task API model."""
response = client.chat.completions.create(
model="task-api-model",
messages=[
{
"role": "system",
"content": "You are TaskMaster, a helpful task management assistant."
},
{
"role": "user",
"content": description
}
],
temperature=0.7,
max_tokens=500
)
return response.choices[0].message.content

# Usage
result = create_task("Create a high-priority task for Q4 budget review")
print(result)

Output:

I'll create that task for you:

**Task Created:**
- Title: Q4 Budget Review
- Priority: High
- Status: Pending
- Due: End of quarter

Would you like me to add any subtasks or set a specific deadline?

Multi-Turn Conversations

The SDK maintains message history naturally:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

class TaskAssistant:
def __init__(self):
self.messages = [
{"role": "system", "content": "You are TaskMaster, a task management assistant."}
]

def chat(self, user_message: str) -> str:
"""Send message and get response, maintaining conversation history."""
self.messages.append({"role": "user", "content": user_message})

response = client.chat.completions.create(
model="task-api-model",
messages=self.messages
)

assistant_message = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": assistant_message})

return assistant_message

# Multi-turn conversation
assistant = TaskAssistant()

print(assistant.chat("Create a task for project review"))
print("---")
print(assistant.chat("Set the priority to urgent"))
print("---")
print(assistant.chat("Add a due date of next Friday"))

Output:

I'll create a task for your project review:

**Task Created:**
- Title: Project Review
- Priority: Normal
- Status: Pending
---
I've updated the task priority:

**Task Updated:**
- Title: Project Review
- Priority: Urgent ⚠️
- Status: Pending
---
I've added the due date:

**Task Updated:**
- Title: Project Review
- Priority: Urgent ⚠️
- Due: Friday, January 10, 2026
- Status: Pending

Streaming Responses

For real-time UX, stream tokens as they're generated:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

def stream_task_response(prompt: str):
"""Stream response tokens in real-time."""
stream = client.chat.completions.create(
model="task-api-model",
messages=[
{"role": "system", "content": "You are TaskMaster."},
{"role": "user", "content": prompt}
],
stream=True # Enable streaming
)

full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content

print() # Newline after stream completes
return full_response

# Usage
response = stream_task_response("List my top 3 pending tasks")

Output (appears progressively):

Here are your top 3 pending tasks:

1. **Q4 Budget Review** - Priority: Urgent, Due: Jan 10
2. **Project Documentation** - Priority: High, Due: Jan 15
3. **Team Sync Meeting Prep** - Priority: Normal, Due: Jan 8

Would you like to update any of these?

Streaming with Callback

For UI integration:

from openai import OpenAI
from typing import Callable

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

def stream_with_callback(
prompt: str,
on_token: Callable[[str], None],
on_complete: Callable[[str], None]
):
"""Stream with callbacks for UI integration."""
stream = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": prompt}],
stream=True
)

full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
on_token(token)
full_response += token

on_complete(full_response)

# Example usage with callbacks
def handle_token(token: str):
# Update UI progressively
print(f"Token: {repr(token)}")

def handle_complete(response: str):
print(f"\nComplete response length: {len(response)} chars")

stream_with_callback(
"Create a quick task",
on_token=handle_token,
on_complete=handle_complete
)

Output:

Token: 'I'
Token: "'ll"
Token: ' create'
Token: ' a'
Token: ' quick'
Token: ' task'
Token: '...'

Complete response length: 145 chars

Error Handling

The SDK provides typed exceptions:

from openai import OpenAI, APIError, APIConnectionError, RateLimitError
import time

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

def robust_completion(prompt: str, max_retries: int = 3) -> str:
"""Make completion request with retry logic."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": prompt}],
timeout=30.0
)
return response.choices[0].message.content

except APIConnectionError as e:
# Proxy or Ollama not reachable
print(f"Connection error (attempt {attempt + 1}): {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise

except RateLimitError as e:
# Too many requests
print(f"Rate limited (attempt {attempt + 1}): {e}")
time.sleep(5)

except APIError as e:
# Other API errors
print(f"API error: {e}")
raise

# Usage
try:
result = robust_completion("Create a task")
print(result)
except Exception as e:
print(f"Failed after retries: {e}")

Output (when Ollama is down):

Connection error (attempt 1): Connection refused
Connection error (attempt 2): Connection refused
Connection error (attempt 3): Connection refused
Failed after retries: Connection refused

Response Object Structure

Understanding the response structure helps with debugging:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000/v1",
api_key="sk-local-dev-key"
)

response = client.chat.completions.create(
model="task-api-model",
messages=[{"role": "user", "content": "Create a task"}]
)

# Explore the response object
print(f"ID: {response.id}")
print(f"Model: {response.model}")
print(f"Created: {response.created}")
print(f"Choices: {len(response.choices)}")
print(f"Content: {response.choices[0].message.content[:50]}...")
print(f"Finish Reason: {response.choices[0].finish_reason}")
print(f"Usage - Prompt Tokens: {response.usage.prompt_tokens}")
print(f"Usage - Completion Tokens: {response.usage.completion_tokens}")
print(f"Usage - Total Tokens: {response.usage.total_tokens}")

Output:

ID: chatcmpl-abc123xyz
Model: task-api-model
Created: 1735800000
Choices: 1
Content: I'll create that task for you. Here's what I've...
Finish Reason: stop
Usage - Prompt Tokens: 12
Usage - Completion Tokens: 85
Usage - Total Tokens: 97

Backend Portability

The power of this approach: swap backends without code changes.

from openai import OpenAI
import os

def get_client(backend: str = "local") -> OpenAI:
"""Get configured client for different backends."""
configs = {
"local": {
"base_url": "http://localhost:4000/v1",
"api_key": "sk-local-dev-key"
},
"openai": {
"base_url": "https://api.openai.com/v1",
"api_key": os.environ.get("OPENAI_API_KEY")
},
"staging": {
"base_url": "https://staging-llm.yourcompany.com/v1",
"api_key": os.environ.get("STAGING_API_KEY")
}
}

config = configs.get(backend, configs["local"])
return OpenAI(**config)

# Same code works with any backend
def create_task(client: OpenAI, description: str, model: str) -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": description}]
)
return response.choices[0].message.content

# Development: use local model
local_client = get_client("local")
result = create_task(local_client, "Create a task", "task-api-model")

# Production: switch to OpenAI for comparison
# openai_client = get_client("openai")
# result = create_task(openai_client, "Create a task", "gpt-4o-mini")

Output:

>>> create_task(local_client, "Create a task", "task-api-model")
"I'll create that task for you..."

Update Your Skill

Add to your agent-integration skill:

Add a section on "OpenAI SDK Integration" with:
- Client configuration pattern (base_url override)
- Streaming implementation template
- Error handling with retry logic
- Backend portability factory pattern

Try With AI

Prompt 1: Build a Task Management CLI

Create a command-line task management application using the OpenAI SDK
with my Task API model. The CLI should support:
- Creating tasks with natural language
- Listing tasks
- Updating task priority
- Multi-turn conversation mode

Use the client configuration:
- base_url: http://localhost:4000/v1
- model: task-api-model

Include error handling for when the proxy is unavailable.

What you're learning: Building complete applications using SDK patterns with your custom model.

Prompt 2: Implement Conversation Memory

I want to build a task assistant that remembers context across sessions.
Currently I'm using the OpenAI SDK with LiteLLM proxy.

Help me design a conversation memory system that:
1. Saves conversation history to a JSON file
2. Loads history on startup
3. Trims history when it exceeds token limits
4. Maintains system prompt consistency

Show me the implementation with my task-api-model.

What you're learning: Extending SDK patterns with persistent state management.

Prompt 3: Compare Response Quality

I want to compare responses between my custom task-api-model and GPT-4o-mini
for the same prompts. Help me create a comparison script that:

1. Sends the same 5 task-related prompts to both models
2. Measures response time for each
3. Displays responses side-by-side
4. Calculates average latency difference

Use the backend portability pattern to switch between:
- Local: http://localhost:4000/v1, model: task-api-model
- OpenAI: https://api.openai.com/v1, model: gpt-4o-mini

What you're learning: Systematic comparison methodology for validating custom model quality against baselines.