Skip to main content

Multi-Agent Handoff & Production

Your voice agent works. Users speak, your agent listens, thinks, and responds. But what happens when a billing question comes in and your general-purpose agent stumbles? Or when a technical issue requires deep domain expertise your triage agent doesn't have?

This is where multi-agent architectures become essential. The same pattern that powers enterprise support systems—where you're transferred from a general representative to a billing specialist or technical expert—applies to voice AI. But here's the difference: in a well-designed system, the specialist already knows your entire conversation history. No "Can you repeat your account number?" No "What was the original issue again?"

This lesson teaches you to build that experience. You'll implement triage-to-specialist handoffs with full context preservation, then deploy the entire system to Kubernetes for production scale. By the end, your livekit-agents skill will be production-ready—a genuine Digital FTE component you can deploy for customers.

The Multi-Agent Pattern

Why Single Agents Hit Limits

A single voice agent can handle general conversations effectively. But consider these scenarios:

ScenarioSingle Agent ChallengeMulti-Agent Solution
"I need a refund for my March invoice"General agent may hallucinate billing policiesBilling specialist with access to invoicing APIs
"My API keeps timing out under load"Generic troubleshooting wastes timeTechnical specialist who can run diagnostics
"I want to upgrade to enterprise"Support agent lacks pricing authoritySales specialist with quote generation tools
"I'm going to cancel unless..."Retention requires nuanced handlingEscalation agent with retention authority

Multi-agent architectures solve this by routing conversations to specialists who have:

  • Domain-specific knowledge encoded in their prompts
  • Specialized tools (billing APIs, diagnostic tools, CRM access)
  • Authority boundaries (what they can promise, what requires escalation)

The Triage Pattern

The triage agent is your front door. It greets users, understands their intent, and routes to the appropriate specialist:

User: "Hi, I have a question about my bill"


┌─────────────────┐
│ Triage Agent │
│ │
│ Intent: billing │
└────────┬────────┘

│ handoff_to(BillingAgent, context)

┌─────────────────┐
│ Billing Agent │
│ │
│ "I see you're │
│ calling about │
│ your March │
│ invoice..." │
└─────────────────┘

The critical insight: handoff includes context. The billing agent doesn't start fresh—it receives the full conversation history and any extracted information (account ID, issue summary, sentiment).

Implementation: The Triage Agent

Here's how to implement the triage pattern with LiveKit Agents:

from livekit.agents import Agent, AgentSession, RunContext
from livekit.agents.llm import ChatMessage
from dataclasses import dataclass
from typing import Optional
import json

@dataclass
class HandoffContext:
"""Context preserved across agent handoffs."""
conversation_history: list[ChatMessage]
user_intent: str
extracted_info: dict
sentiment: str
session_id: str

class TriageAgent(Agent):
"""Routes users to appropriate specialist agents."""

INTENTS = {
"billing": ["invoice", "payment", "refund", "charge", "bill", "pricing"],
"technical": ["error", "bug", "timeout", "crash", "not working", "broken"],
"sales": ["upgrade", "enterprise", "pricing", "demo", "features"],
"escalation": ["cancel", "frustrated", "manager", "complaint", "unacceptable"]
}

async def on_enter(self, ctx: RunContext):
"""Greet user and begin intent detection."""
await ctx.say(
"Hello! I'm here to help you today. "
"What can I assist you with?"
)

async def on_user_turn(self, ctx: RunContext, message: str):
"""Detect intent and route to specialist."""
# Classify intent
intent = await self.classify_intent(ctx, message)

if intent == "billing":
await self.handoff_to_specialist(
ctx, BillingAgent, "billing specialist"
)
elif intent == "technical":
await self.handoff_to_specialist(
ctx, TechnicalAgent, "technical support specialist"
)
elif intent == "sales":
await self.handoff_to_specialist(
ctx, SalesAgent, "sales representative"
)
elif intent == "escalation":
await self.handoff_to_specialist(
ctx, EscalationAgent, "customer success manager"
)
else:
# Handle general queries directly
response = await ctx.llm.generate(message)
await ctx.say(response)

async def classify_intent(
self, ctx: RunContext, message: str
) -> Optional[str]:
"""Use LLM to classify user intent."""
classification_prompt = f"""
Classify the following user message into one category:
- billing: Questions about invoices, payments, refunds
- technical: Issues with the product, errors, bugs
- sales: Interest in purchasing, upgrading, demos
- escalation: Frustrated customers, complaints, cancellation threats
- general: Everything else

User message: "{message}"

Return only the category name, nothing else.
"""

result = await ctx.llm.generate(classification_prompt)
intent = result.strip().lower()

return intent if intent in self.INTENTS else None

async def handoff_to_specialist(
self,
ctx: RunContext,
agent_class: type,
agent_name: str
):
"""Transfer to specialist with full context."""
# Build handoff context
context = HandoffContext(
conversation_history=ctx.chat_history,
user_intent=self.last_intent,
extracted_info=await self.extract_key_info(ctx),
sentiment=await self.detect_sentiment(ctx),
session_id=ctx.session.id
)

# Announce the transfer
await ctx.say(
f"I'm connecting you with our {agent_name} who can "
"help you with this. One moment please."
)

# Perform handoff
await ctx.handoff(agent_class, context=context)

Output:

User: "Hi, I have a question about my March invoice"
Agent: "Hello! I'm here to help you today. What can I assist you with?"
[Intent detected: billing]
Agent: "I'm connecting you with our billing specialist who can
help you with this. One moment please."
[Handoff to BillingAgent with context]

Implementation: The Specialist Agent

The specialist agent receives context and acknowledges the prior conversation:

class BillingAgent(Agent):
"""Handles billing inquiries with invoice API access."""

async def on_enter(self, ctx: RunContext):
"""Acknowledge context and demonstrate awareness."""
if ctx.handoff_context:
# We have context from triage
context = ctx.handoff_context

# Summarize what we know
await ctx.say(
f"Hi there! I'm your billing specialist. "
f"I see you're calling about your invoice. "
f"Let me pull up your account details."
)

# If we extracted account info, use it
if "account_id" in context.extracted_info:
account = await self.fetch_account(
context.extracted_info["account_id"]
)
await ctx.say(
f"I have your account open, {account.name}. "
f"How can I help with your billing question?"
)
else:
# Direct entry without handoff
await ctx.say(
"Hello! I'm your billing specialist. "
"How can I help you today?"
)

async def on_user_turn(self, ctx: RunContext, message: str):
"""Handle billing-specific queries with tool access."""
# Use MCP tools for billing operations
if "refund" in message.lower():
result = await ctx.mcp.call_tool(
"billing_server",
"check_refund_eligibility",
{"session_id": ctx.session.id}
)
await ctx.say(self.format_refund_response(result))
elif "invoice" in message.lower():
invoices = await ctx.mcp.call_tool(
"billing_server",
"get_recent_invoices",
{"session_id": ctx.session.id}
)
await ctx.say(self.format_invoice_response(invoices))
else:
response = await ctx.llm.generate(
message,
system=self.BILLING_SYSTEM_PROMPT
)
await ctx.say(response)

Output:

[After handoff from TriageAgent]
BillingAgent: "Hi there! I'm your billing specialist. I see you're
calling about your invoice. Let me pull up your
account details."
[Fetches account via API]
BillingAgent: "I have your account open, Sarah. How can I help
with your billing question?"
User: "I was charged twice for March"
BillingAgent: [Calls check_refund_eligibility tool]
"I can see the duplicate charge on your account.
You're eligible for a full refund of $49.99. Would
you like me to process that now?"

Context Preservation Deep Dive

What to Preserve

Not all context is equal. Preserve information that prevents user repetition:

Context TypeExampleWhy Preserve
Intent summary"User asking about March invoice refund"Specialist knows the topic immediately
Extracted entitiesaccount_id, invoice_number, product_nameNo "Can you give me your account number again?"
Conversation historyFull transcript of triage conversationSpecialist can reference prior statements
Sentimentfrustrated, neutral, happySpecialist adjusts tone appropriately
Prior attempts"User already tried clearing cache"No redundant troubleshooting

What NOT to Preserve

Large or stale context degrades performance:

Context TypeWhy Exclude
Raw audio buffersToo large, already transcribed
Internal system logsNot relevant to conversation
Old session dataStale context confuses the LLM
PII beyond what's neededPrivacy and security risk

Implementation: Context Builder

class ContextBuilder:
"""Builds minimal, relevant handoff context."""

@staticmethod
async def build_handoff_context(
ctx: RunContext,
intent: str
) -> HandoffContext:
"""Extract only what the specialist needs."""

# Summarize conversation (not full history for long sessions)
summary = await ctx.llm.generate(
f"Summarize this conversation in 2-3 sentences, "
f"focusing on the user's {intent} issue:\n"
f"{ctx.chat_history[-10:]}" # Last 10 turns only
)

# Extract structured information
extracted = await ContextBuilder.extract_entities(ctx, intent)

# Detect sentiment for tone matching
sentiment = await ContextBuilder.detect_sentiment(ctx)

return HandoffContext(
conversation_history=ctx.chat_history[-10:],
user_intent=intent,
extracted_info=extracted,
sentiment=sentiment,
session_id=ctx.session.id
)

@staticmethod
async def extract_entities(
ctx: RunContext,
intent: str
) -> dict:
"""Extract intent-specific entities."""
extraction_prompt = f"""
Extract key information from this conversation for a
{intent} specialist. Return JSON with relevant fields.

For billing: account_id, invoice_number, amount, date
For technical: error_message, product, steps_tried
For sales: current_plan, interest, company_size

Conversation: {ctx.chat_history[-5:]}

Return only valid JSON.
"""

result = await ctx.llm.generate(extraction_prompt)
try:
return json.loads(result)
except json.JSONDecodeError:
return {}

Kubernetes Deployment

Your multi-agent voice system works locally. Now let's deploy it to Kubernetes for production scale.

Architecture for Scale

                    ┌─────────────────────────────────────┐
│ Kubernetes Cluster │
└─────────────────────────────────────┘

┌────────────────────────────┼────────────────────────────┐
│ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ Worker Pod 1 │ │ Worker Pod 2 │ │ Worker Pod N │
│ │ │ │ │ │
│ TriageAgent │ │ BillingAgent │ │ TechnicalAgent│
│ BillingAgent │ │ SalesAgent │ │ EscalationAgt │
│ TechnicalAgt │ │ TechnicalAgt │ │ SalesAgent │
└───────┬───────┘ └───────────────┘ └───────────────┘

│ Session State

┌───────────────┐
│ Redis │ ◄── Session persistence, handoff context
│ Cluster │
└───────────────┘

Worker Deployment

# livekit-workers-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: voice-agent-workers
labels:
app: voice-agent
component: worker
spec:
replicas: 3
selector:
matchLabels:
app: voice-agent
component: worker
template:
metadata:
labels:
app: voice-agent
component: worker
spec:
containers:
- name: worker
image: myregistry/voice-agent:v1.2.0
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: LIVEKIT_URL
valueFrom:
secretKeyRef:
name: livekit-credentials
key: url
- name: LIVEKIT_API_KEY
valueFrom:
secretKeyRef:
name: livekit-credentials
key: api-key
- name: LIVEKIT_API_SECRET
valueFrom:
secretKeyRef:
name: livekit-credentials
key: api-secret
- name: REDIS_URL
value: "redis://redis-cluster:6379"
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-credentials
key: api-key
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3

Horizontal Pod Autoscaler

Scale based on concurrent sessions, not just CPU:

# voice-agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: voice-agent-workers
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: active_sessions
target:
type: AverageValue
averageValue: "50" # 50 concurrent sessions per pod
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120

Session Persistence with Redis

Sessions must survive pod restarts:

# session_persistence.py
from redis import asyncio as aioredis
from livekit.agents import AgentSession
import pickle
from typing import Optional

class RedisSessionStore:
"""Persist agent sessions to Redis for fault tolerance."""

def __init__(self, redis_url: str):
self.redis = aioredis.from_url(redis_url)
self.ttl = 3600 # 1 hour session TTL

async def save_session(self, session: AgentSession):
"""Persist session state."""
key = f"session:{session.id}"
data = {
"id": session.id,
"agent_type": session.agent.__class__.__name__,
"chat_history": [
{"role": m.role, "content": m.content}
for m in session.chat_history
],
"context": session.context,
"created_at": session.created_at.isoformat(),
}
await self.redis.setex(
key,
self.ttl,
pickle.dumps(data)
)

async def load_session(
self, session_id: str
) -> Optional[dict]:
"""Restore session from Redis."""
key = f"session:{session_id}"
data = await self.redis.get(key)
if data:
return pickle.loads(data)
return None

async def save_handoff_context(
self,
session_id: str,
context: HandoffContext
):
"""Persist handoff context for cross-pod handoffs."""
key = f"handoff:{session_id}"
await self.redis.setex(
key,
300, # 5 minute TTL for handoffs
pickle.dumps(context)
)

Output:

# Verify deployment
$ kubectl get pods -l app=voice-agent
NAME READY STATUS RESTARTS AGE
voice-agent-workers-7d9f8b6c44-abc12 1/1 Running 0 5m
voice-agent-workers-7d9f8b6c44-def34 1/1 Running 0 5m
voice-agent-workers-7d9f8b6c44-ghi56 1/1 Running 0 5m

# Check HPA status
$ kubectl get hpa voice-agent-hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
voice-agent-hpa Deployment/voice-agents 23/50, 45%/70% 3 20 3

# Verify Redis sessions
$ kubectl exec -it redis-0 -- redis-cli keys "session:*" | wc -l
47

Health Checks Implementation

# health.py
from fastapi import FastAPI
from livekit.agents import Worker

app = FastAPI()
worker: Worker = None

@app.get("/health/live")
async def liveness():
"""Pod is alive and should not be killed."""
return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
"""Pod is ready to accept new sessions."""
if worker is None:
return {"status": "not_ready", "reason": "worker_not_initialized"}, 503

if worker.active_sessions >= worker.max_sessions:
return {"status": "not_ready", "reason": "at_capacity"}, 503

if not await worker.check_livekit_connection():
return {"status": "not_ready", "reason": "livekit_disconnected"}, 503

return {
"status": "ready",
"active_sessions": worker.active_sessions,
"capacity": worker.max_sessions
}

Finalize Your Skill: Production Ready

You've learned multi-agent patterns and Kubernetes deployment. Now update your livekit-agents skill to include this production knowledge.

Skill Review Checklist

Your livekit-agents skill should now guide:

CapabilityStatusNotes
Architecture explanationFrom Lesson 1Agents, AgentSessions, Workers
Voice pipeline setupFrom Lesson 1STT, LLM, TTS configuration
Turn detection configFrom Lesson 2Semantic detection, barge-in
MCP integrationFrom Lesson 2One-line tool connection
Multi-agent handoffsNEWTriage patterns, context preservation
K8s deploymentNEWWorkers, HPA, session persistence

Update Your Skill

Add these sections to .claude/skills/livekit-agents/SKILL.md:

## Multi-Agent Architecture

### When to Use Multi-Agent
- User intents span multiple domains (billing, technical, sales)
- Specialists need different tools and authority levels
- Complex issues require escalation paths

### The Triage Pattern
1. Triage agent greets user and detects intent
2. On intent match, build HandoffContext with:
- Conversation history (last 10 turns)
- Extracted entities (account_id, issue summary)
- Sentiment (for tone matching)
3. Announce transfer, then handoff to specialist
4. Specialist acknowledges context immediately

### Context Preservation Rules
- PRESERVE: Intent, entities, sentiment, recent history
- EXCLUDE: Raw audio, system logs, stale data, excessive PII

## Kubernetes Deployment

### Deployment Checklist
- [ ] Workers deployed with resource limits (500m-2000m CPU, 512Mi-2Gi RAM)
- [ ] HPA configured for active_sessions metric (50/pod target)
- [ ] Redis for session persistence (1 hour TTL)
- [ ] Health probes: /health/live (liveness), /health/ready (readiness)
- [ ] Secrets for LIVEKIT_*, OPENAI_API_KEY

### Scaling Considerations
- Scale on sessions, not just CPU (voice agents are memory-bound)
- Use slow scale-down (300s stabilization) to avoid session disruption
- Enable Redis cluster for high availability

Test Your Finalized Skill

Prompt: "I need to deploy a multi-agent voice system to Kubernetes.
We have triage, billing, and technical specialists. Walk me through
the deployment architecture and what I need to configure."

Your skill should now generate:

  1. Multi-agent architecture with handoff patterns
  2. Kubernetes manifests with proper resource configuration
  3. Redis session persistence setup
  4. Health check endpoints
  5. HPA configuration for voice workloads

Try With AI

Use your livekit-agents skill with these prompts to solidify production patterns.

Prompt 1: Design Your Agent Topology

I'm building a voice-enabled customer support system with these requirements:

- Users call about billing (40%), technical issues (35%), sales (15%),
other (10%)
- Frustrated customers should be escalated immediately
- Each specialist needs different MCP tools:
- Billing: invoice API, refund processor
- Technical: diagnostic tools, ticket system
- Sales: CRM, pricing calculator

Use my livekit-agents skill to design:
1. The triage agent's intent detection logic
2. What context passes in each handoff type
3. Fallback behavior when specialists are overloaded
4. How to handle mid-conversation topic switches

I'll implement this and test with real scenarios.

What you're learning: Architecture decisions for production multi-agent systems - balancing user experience with operational complexity.

Prompt 2: Optimize for Scale

My voice agent is deployed but I'm seeing issues at scale:

- 500+ concurrent sessions
- Some handoffs fail when the receiving pod is at capacity
- Sessions are lost when pods restart during traffic spikes
- Turn detection latency increases under load

Current setup:
- 5 worker pods, 100 sessions each
- Redis single instance
- No GPU for STT (using API)

Use my livekit-agents skill to diagnose and recommend:
1. What's likely causing handoff failures?
2. How do I make session persistence more robust?
3. Should I add GPU nodes for STT? What's the cost/benefit?
4. What metrics should I monitor for early warning?

Walk me through each recommendation step by step.

What you're learning: Performance optimization - diagnosing bottlenecks and making data-driven infrastructure decisions.

Prompt 3: Production Checklist

I'm ready to launch my voice agent to production customers. Before
I do, I want to validate my deployment is production-grade.

My current state:
- Multi-agent system: triage + 3 specialists
- Kubernetes: 10 pods across 3 nodes
- Redis cluster: 3 nodes
- Monitoring: Prometheus + Grafana
- No GPU (API-based STT/TTS)

Use my livekit-agents skill to generate a production readiness checklist:
1. What could fail that I haven't tested?
2. What monitoring alerts should exist before launch?
3. What's my rollback plan if something goes wrong?
4. What documentation do I need for on-call engineers?
5. What load testing should I run first?

Be thorough - I'd rather delay launch than have an outage.

What you're learning: Production readiness evaluation - thinking about failure modes before they happen.

Safety Note

Multi-agent voice systems handle real customer conversations. Before production deployment:

  • Test handoff flows with realistic scenarios
  • Verify context preservation doesn't leak PII between unrelated sessions
  • Ensure fallback behavior is graceful (never drop a call silently)
  • Monitor sentiment trends to catch systemic issues