Chapter 85: Production Voice Agent (Capstone) - Lesson Plan
Generated by: chapter-planner v2.0.0 (Reasoning-Activated) Source: Part 11 README, Chapters 79-84 context, Deep Search Report Created: 2026-01-02 Constitution: v6.0.0 (Reasoning Mode)
I. Chapter Analysis
Chapter Type
TECHNICAL (LAYER 4 CAPSTONE) - This is the integrative capstone for Part 11. Students orchestrate ALL accumulated knowledge from Chapters 79-84 into a production-grade voice-enabled Task Manager.
Recognition signals:
- Learning objectives use "design/architect/deploy/orchestrate"
- No new framework concepts - composition of existing skills
- Layer 4: Spec-Driven Integration (spec FIRST, then implementation)
- Composition of skills:
livekit-agents,pipecat,voice-telephony,web-audio-capture - Production deployment with Kubernetes, monitoring, cost optimization
- Business considerations: cost analysis, compliance, SLAs
This is NOT a skill-first (L00) chapter - Capstones COMPOSE existing skills, they don't create new ones.
Concept Density Analysis
Core Concepts (from Capstone requirements): 8 integration concepts
- Multi-channel architecture (browser WebRTC + phone SIP)
- Provider selection strategy (when Native S2S vs Cascaded Pipeline)
- Multimodal integration (voice + screen sharing via Gemini)
- Conversation design (turn-taking, interruption, barge-in patterns)
- Kubernetes voice deployment (session persistence, scaling, GPU nodes)
- Observability for voice (latency metrics, transcription quality, cost tracking)
- Cost optimization (economy stack: Deepgram + GPT-4o-mini + Cartesia)
- Production operations (compliance, failover, SLAs)
Complexity Assessment: Complex (integrative capstone requiring synthesis of 6 chapters)
Proficiency Tier: B2-C1 (capstone requires advanced synthesis)
Justified Lesson Count: 3 lessons (capstone scope)
- Lesson 1: System Architecture & Specification Design (Layer 4: Spec-First)
- Lesson 2: Implementation & Integration (Layer 4: AI Orchestrates Using Skills)
- Lesson 3: Production Deployment & Operations (Layer 4: Validation & Operations)
Reasoning:
- Capstones are integrative, not additive - students already learned the concepts
- 8 integration concepts across 3 lessons = 2-3 concepts per lesson (synthesis, not learning)
- Layer 4 requires spec-first approach: write specification BEFORE implementation
- 3 lessons follow the spec->implement->deploy pattern
- Total ~120-150 minutes for complete production voice agent
II. Success Evals (from Part 11 README)
Success Criteria (what students must demonstrate):
- Architecture Design: Students architect a multi-channel voice system combining browser (WebRTC) and phone (SIP/Twilio) with justified technology choices
- Provider Selection: Students select appropriate providers for their use case (Native S2S for premium UX vs Economy Stack for high volume) with cost analysis
- Multimodal Integration: Students integrate voice + screen sharing for context-aware task management
- Natural Conversation: Students implement proper turn-taking, semantic interruption detection, and barge-in handling
- Production Deployment: Students deploy to Kubernetes with session persistence, HPA, and GPU-aware scheduling
- Observability: Students implement voice-specific metrics (latency percentiles, transcription accuracy, cost per call)
- Cost Target: Students achieve $0.03-0.07 per minute target with economy stack
- End-to-End Latency: Students achieve sub-800ms end-to-end latency
All lessons below map to these evals.
III. Accumulated Skills from Part 11
Skills students bring to this capstone (created in Chapters 80-84):
| Chapter | Skill | Purpose in Capstone |
|---|---|---|
| 62 | livekit-agents | Browser WebRTC, multi-agent handoff, K8s patterns |
| 63 | pipecat | Provider flexibility, custom processors |
| 66 | voice-telephony | Phone integration, SIP, Twilio |
| 66 | web-audio-capture | Browser audio, Silero VAD |
Direct API knowledge (Chapters 82-83):
- OpenAI Realtime API for native S2S when premium UX required
- Gemini Live API for voice + vision multimodal
Foundational knowledge (Chapter 79):
- Architecture decision matrix (Native S2S vs Cascaded)
- Latency budgets and optimization
- Technology stack tradeoffs
IV. Lesson Sequence
Lesson 1: System Architecture & Specification Design
Title: System Architecture & Specification Design
Learning Objectives:
- Write a production specification for a multi-channel voice agent (spec FIRST)
- Design system architecture combining LiveKit (browser) and Twilio (phone)
- Select providers based on latency, cost, and quality requirements
- Justify technology choices with documented trade-offs
- Define success metrics and acceptance criteria upfront
Stage: Layer 4 (Spec-Driven Integration) - Specification is the PRIMARY artifact
CEFR Proficiency: B2
Integration Concepts (count: 3):
- Multi-channel architecture design
- Provider selection strategy
- Specification-first design
Cognitive Load Validation: 3 integration concepts (synthesis of known material) <= 10 limit (B2) -> WITHIN LIMIT
Maps to Evals: #1 (Architecture Design), #2 (Provider Selection), #7 (Cost Target)
Key Sections:
-
The Capstone Project (~5 min)
- What you're building: Voice-enabled Task Manager
- Channels: Browser (WebRTC), Phone (Twilio)
- Capabilities: Voice commands, screen sharing, natural conversation
- Business context: 24/7 voice assistant for task management
- Why spec-first: Define success BEFORE implementation
-
Write Your Production Specification (~20 min)
- Intent: What problem does this voice agent solve?
- Channels: Browser (LiveKit WebRTC) + Phone (Twilio SIP)
- User Stories:
- "As a user, I can speak to my Task Manager via browser"
- "As a user, I can call a dedicated phone number to manage tasks"
- "As a user, I can share my screen and say 'add this to my tasks'"
- Functional Requirements:
- Sub-800ms end-to-end latency
- Natural turn-taking with semantic detection
- Barge-in support (user can interrupt)
- Task CRUD via voice (list, create, complete, delete)
- Non-Functional Requirements:
- Cost target: $0.03-0.07 per minute
- 99.5% availability
- GDPR compliance for call recording (if enabled)
- Success Metrics:
- P95 latency < 800ms
- Task creation success rate > 95%
- User satisfaction > 4.0/5.0
-
Architecture Decision: Native S2S vs Cascaded (~10 min)
- Review Chapter 79 decision matrix
- For THIS project: Economy stack (cost-sensitive, high volume potential)
- Native S2S reserved for: Premium tiers, demo environments
- Document the decision in your spec:
## Architecture Decision: Cascaded Pipeline
**Decision**: Use cascaded pipeline (STT -> LLM -> TTS) for primary flow
**Rationale**:
- Cost target $0.03-0.07/min rules out Native S2S (~$0.11/min)
- Economy stack achieves $0.033/min
- Latency target (800ms) achievable with cascaded
- Provider flexibility for future optimization
**Trade-off**: Higher latency (500-800ms vs 200-300ms)
**Mitigation**: Semantic turn detection, optimized providers
-
Provider Selection (~10 min)
- STT: Deepgram Nova-3 ($0.0077/min, 90ms latency)
- LLM: GPT-4o-mini ($0.0015/min, 200-400ms latency)
- TTS: Cartesia Sonic-3 ($0.024/min, 40-90ms latency)
- VAD: Silero VAD (free, <1ms)
- Total: ~$0.033/min (within target)
- Document provider selection with alternatives
-
Multi-Channel Architecture Diagram (~10 min)
- Draw the system architecture:
┌─────────────────────────────────────────────────────────────┐
│ Voice Task Manager │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
┌────────▼────────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ Browser Client │ │ Phone Client │ │ Screen Share │
│ (LiveKit WebRTC)│ │ (Twilio SIP) │ │ (Gemini Live) │
└────────┬────────┘ └───────┬───────┘ └────────┬────────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
┌──────────▼──────────┐
│ Voice Agent Core │
│ (LiveKit Agents) │
└──────────┬──────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
┌────────▼────────┐ ┌───────▼───────┐ ┌────────▼────────┐
│ STT Pipeline │ │ LLM Router │ │ TTS Pipeline │
│ (Deepgram Nova-3)│ │ (GPT-4o-mini) │ │ (Cartesia Sonic) │
└─────────────────┘ └───────┬───────┘ └─────────────────┘
│
┌──────────▼──────────┐
│ Task Manager API │
│ (from Part 6/7) │
└─────────────────────┘ - Explain component responsibilities
- Identify integration points
- Draw the system architecture:
Duration Estimate: 55 minutes
Three Roles Integration (Layer 4 Spec-Driven):
You as Architect:
- Write the specification (AI does not write specs for you)
- Make architecture decisions with documented trade-offs
- Define success metrics
AI as Implementation Partner:
- Review specification for completeness
- Suggest missing requirements based on production experience
- Validate cost calculations
Convergence:
- Iterate on specification until both agree it is complete
- AI validates that spec covers all capstone requirements
Try With AI Prompts:
-
Review Your Specification
I wrote a production specification for my voice-enabled Task Manager:
[paste your spec.md]
Review this specification against these criteria:
1. Are all capstone requirements covered? (browser, phone, screen share)
2. Are the success metrics measurable?
3. Is the cost analysis realistic?
4. What am I missing that would cause production issues?
Be critical - I want to find gaps NOW, not during implementation.What you're learning: Spec validation - catching gaps before implementation.
-
Validate Architecture Decisions
I chose a cascaded pipeline over Native S2S for cost reasons:
- Economy stack: $0.033/min (Deepgram + GPT-4o-mini + Cartesia)
- Native S2S: $0.11/min (OpenAI Realtime)
My latency target is sub-800ms. Challenge my decision:
1. Can cascaded pipeline actually achieve 800ms?
2. What scenarios would force me to reconsider Native S2S?
3. What's my fallback if economy stack quality is insufficient?
Help me stress-test this decision.What you're learning: Decision validation - stress-testing architectural choices.
-
Design the Integration Points
My voice agent needs to integrate with:
- LiveKit for browser WebRTC
- Twilio for phone SIP
- Gemini Live for screen sharing
- Task Manager API (from Part 6/7)
Help me design the integration interfaces:
1. How do different channels converge to the same agent logic?
2. How does the agent know which channel a request came from?
3. How do I handle channel-specific features (screen share only on browser)?
Draw the interface contracts between components.What you're learning: Integration design - defining clean boundaries between systems.
Lesson 2: Implementation & Integration
Title: Implementation & Integration
Learning Objectives:
- Implement voice agent core using accumulated skills (
livekit-agents,pipecat) - Integrate browser channel with LiveKit WebRTC
- Integrate phone channel with Twilio SIP via
voice-telephonyskill - Add multimodal screen sharing with Gemini Live API
- Implement natural conversation patterns (turn-taking, barge-in)
- Connect voice agent to Task Manager API via MCP
Stage: Layer 4 (AI Orchestrates Using Skills) - Implementation follows spec
CEFR Proficiency: B2
Integration Concepts (count: 3):
- Multi-channel implementation
- Multimodal integration (voice + vision)
- Conversation design patterns
Cognitive Load Validation: 3 integration concepts <= 10 limit (B2) -> WITHIN LIMIT
Maps to Evals: #3 (Multimodal Integration), #4 (Natural Conversation), #8 (End-to-End Latency)
Key Sections:
-
Implementation Strategy (~5 min)
- Spec guides implementation (reference your spec.md)
- Use skills to scaffold, not write from scratch
- Implementation order: Core -> Browser -> Phone -> Screen Share
- Test each channel before integration
-
Voice Agent Core (~15 min)
- Use
livekit-agentsskill to scaffold agent - Configure economy stack providers:
from livekit.agents import AgentContext
from livekit.plugins import deepgram, openai, cartesia
async def entrypoint(ctx: AgentContext):
agent = VoiceAgent(
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=cartesia.TTS(model="sonic-3"),
)
# Connect Task Manager MCP
agent.add_mcp_server("./task-manager-mcp")
await agent.start(ctx) - Configure semantic turn detection
- Add MCP server connection for Task Manager
- Use
-
Browser Channel (LiveKit WebRTC) (~15 min)
- Use
web-audio-captureskill for browser client - WebRTC room connection from browser
- Audio capture with Silero VAD
- UI considerations: mute button, speaking indicator
- Test: Voice command -> Task creation -> Confirmation
- Use
-
Phone Channel (Twilio SIP) (~15 min)
- Use
voice-telephonyskill for Twilio integration - SIP trunk configuration
- Phone number provisioning
- Inbound call routing to LiveKit agent
- Test: Call number -> Voice command -> Task creation
- Use
-
Multimodal Screen Sharing (~15 min)
- Use Gemini Live API knowledge from Chapter 83
- Screen sharing consent and capture
- Visual context sent to Gemini alongside voice
- Example: "Add what's on my screen to my tasks"
- Integration with Task Manager API
-
Natural Conversation Patterns (~10 min)
- Semantic turn detection configuration (from Chapter 80)
- Barge-in handling (from Chapter 82)
- Filler speech during async operations
- Confirmation flows for destructive actions
- Test conversational quality
-
Integration Testing (~5 min)
- End-to-end test: Browser voice command
- End-to-end test: Phone call flow
- End-to-end test: Screen share task creation
- Measure latency against spec target (800ms)
Duration Estimate: 80 minutes
Three Roles Integration (Layer 4 AI Orchestrates):
AI as Implementer:
- Use your skills to generate implementation code
- AI follows your specification, not its own ideas
You as Validator:
- Test each component against spec requirements
- Validate latency meets targets
- Ensure conversation quality
Convergence:
- Iterate on implementation until all spec requirements pass
- Document any spec changes discovered during implementation
Try With AI Prompts:
-
Scaffold the Voice Agent Core
Using my livekit-agents skill, scaffold the voice agent core for my
Task Manager:
From my spec:
- STT: Deepgram Nova-3
- LLM: GPT-4o-mini
- TTS: Cartesia Sonic-3
- MCP: Task Manager API (list_tasks, create_task, complete_task)
Include:
1. Semantic turn detection configuration
2. MCP server connection
3. System prompt for task management persona
4. Graceful error handling
Follow my spec exactly. I'll test and validate.What you're learning: Spec-driven implementation - AI implements YOUR specification.
-
Integrate Phone Channel
Using my voice-telephony skill, integrate Twilio phone support:
Requirements from my spec:
- Inbound calls routed to LiveKit agent
- Same conversation logic as browser channel
- Phone-specific greeting: "Task Manager speaking, how can I help?"
I have:
- Twilio account with SIP trunk configured
- LiveKit server running
Walk me through the Twilio -> LiveKit routing configuration.What you're learning: Channel integration - connecting telephony to voice agent.
-
Add Screen Sharing
I need to add screen sharing capability using Gemini Live API.
Use case from my spec:
- User shares screen while talking
- Says: "Add what I'm looking at to my tasks"
- Agent sees the screen, extracts context, creates task
Using my knowledge from Chapter 83, help me:
1. Configure Gemini Live for voice + vision
2. Integrate with my LiveKit-based voice agent
3. Handle the screen share permission flow
4. Extract visual context for task creation
This is the multimodal piece of my capstone.What you're learning: Multimodal integration - combining voice and vision modalities.
Lesson 3: Production Deployment & Operations
Title: Production Deployment & Operations
Learning Objectives:
- Deploy voice agent to Kubernetes with production configurations
- Implement session persistence across pod restarts
- Configure horizontal pod autoscaling for voice workloads
- Set up voice-specific observability (latency, quality, cost metrics)
- Implement cost monitoring against $0.03-0.07/min target
- Document compliance considerations (recording, consent)
- Design failover strategies for voice infrastructure
Stage: Layer 4 (Validation & Operations) - Production readiness
CEFR Proficiency: B2-C1
Integration Concepts (count: 2):
- Kubernetes voice deployment
- Observability and operations
Cognitive Load Validation: 2 integration concepts <= 10 limit (B2-C1) -> WITHIN LIMIT
Maps to Evals: #5 (Production Deployment), #6 (Observability), #7 (Cost Target)
Key Sections:
-
Kubernetes Deployment Strategy (~10 min)
- Review Part 7 Kubernetes patterns
- Voice-specific considerations:
- Session affinity for conversation continuity
- Redis for session state persistence
- GPU nodes for VAD/turn detection models (if used)
- Deployment architecture:
# voice-agent deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: voice-agent
spec:
replicas: 3
selector:
matchLabels:
app: voice-agent
template:
spec:
containers:
- name: voice-agent
image: task-manager-voice:latest
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
env:
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: voice-secrets
key: redis-url
-
Session Persistence (~10 min)
- Why session persistence matters: Mid-call pod restart
- Redis configuration for session state
- Session reconnection logic
- Test: Pod restart during active call
- Code: Session persistence implementation
-
Horizontal Pod Autoscaling (~10 min)
- Scaling voice workloads (CPU-bound, not memory-bound)
- HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: voice-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 - Scaling based on concurrent sessions vs CPU
- Test: Load test with scaling
-
Voice Observability Stack (~15 min)
- Key metrics for voice agents:
voice_latency_p95: End-to-end response time (target: 800ms)voice_stt_duration: Speech-to-text processing timevoice_llm_duration: LLM response timevoice_tts_duration: Text-to-speech processing timevoice_cost_per_call: Running cost calculationvoice_transcription_errors: STT quality indicator
- Prometheus metrics exposition
- Grafana dashboard for voice operations
- Alerting: Latency > 1s, Cost > $0.10/min
- Key metrics for voice agents:
-
Cost Monitoring & Optimization (~10 min)
- Cost tracking implementation:
# Track per-call costs
stt_cost = audio_duration_minutes * 0.0077 # Deepgram
llm_cost = tokens * 0.000002 # GPT-4o-mini
tts_cost = audio_duration_minutes * 0.024 # Cartesia
total_cost = stt_cost + llm_cost + tts_cost
metrics.observe('voice_cost_per_call', total_cost) - Cost dashboard with daily/weekly rollups
- Alerting when cost exceeds target
- Optimization strategies: Caching, prompt optimization
- Cost tracking implementation:
-
Compliance & Recording (~10 min)
- Recording consent requirements (varies by jurisdiction)
- GDPR considerations for EU users
- Data retention policies
- Consent flow implementation:
- Browser: Click-to-consent before microphone
- Phone: "This call may be recorded for quality purposes"
- Recording storage and access controls
-
Failover & Resilience (~10 min)
- Provider failover: Deepgram -> Whisper fallback
- Regional failover: Multi-region deployment
- Graceful degradation: Text fallback when voice fails
- SLA considerations: 99.5% availability target
- Incident runbook for voice outages
-
Production Validation (~5 min)
- Final checklist against spec:
- Sub-800ms P95 latency
- $0.03-0.07/min cost achieved
- Browser channel working
- Phone channel working
- Screen share working
- Session persistence validated
- Monitoring and alerting active
- Sign-off: Production-ready voice agent
- Final checklist against spec:
Duration Estimate: 80 minutes
Three Roles Integration (Layer 4 Operations):
AI as Operations Advisor:
- Suggest deployment patterns based on your requirements
- Help design monitoring and alerting
You as Operator:
- Deploy and validate in your environment
- Make compliance decisions for your jurisdiction
- Own the production system
Convergence:
- Production checklist complete
- All spec requirements validated
- Voice-enabled Task Manager is live
Try With AI Prompts:
-
Generate Kubernetes Manifests
Using my livekit-agents skill and Part 7 knowledge, generate
Kubernetes manifests for my voice agent:
Requirements from my spec:
- 2-20 replicas based on load
- Redis for session persistence
- Prometheus metrics exposure
- Health checks (liveness + readiness)
- Secrets management for API keys
My cluster: [describe your K8s setup]
Generate the complete manifest set I can apply.What you're learning: Production deployment - operationalizing voice agents.
-
Design the Observability Dashboard
I need a Grafana dashboard for monitoring my voice agent:
Key metrics to track:
- End-to-end latency (P50, P95, P99)
- Per-component latency (STT, LLM, TTS)
- Cost per call and daily cost
- Concurrent sessions
- Error rates
Help me design:
1. Dashboard layout and panels
2. Prometheus queries for each metric
3. Alerting rules (latency > 1s, cost > $0.10/min)
I'll implement and configure based on your design.What you're learning: Voice observability - monitoring what matters for voice systems.
-
Plan the Failover Strategy
My voice agent needs resilience against:
1. Primary STT provider (Deepgram) goes down
2. Primary TTS provider (Cartesia) goes down
3. AWS region outage
Help me design failover for each scenario:
- Detection: How do I know there's a problem?
- Failover: What's the backup?
- Recovery: How do I return to primary?
- User impact: What do users experience?
I need this documented for my operations runbook.What you're learning: Resilience engineering - designing for failure in voice systems.
V. Skill Composition
Skills Used in This Capstone (NOT created, composed):
| Skill | Source | Role in Capstone |
|---|---|---|
livekit-agents | Chapter 80 | Core voice agent, WebRTC, multi-agent patterns |
pipecat | Chapter 81 | Provider flexibility, custom processors |
voice-telephony | Chapter 84 | Phone/Twilio integration |
web-audio-capture | Chapter 84 | Browser audio capture |
Direct API Knowledge Applied:
- OpenAI Realtime API (Chapter 82): Understanding for Native S2S comparison
- Gemini Live API (Chapter 83): Multimodal screen sharing integration
No new skills created - Capstones compose existing skills into production systems.
VI. Assessment Plan
Formative Assessments (During Lessons)
- Lesson 1: Specification review (complete, measurable, realistic)
- Lesson 2: Integration testing (each channel works independently)
- Lesson 3: Production checklist (all items validated)
Summative Assessment (End of Chapter)
Capstone Demonstration:
Students demonstrate their production voice agent:
-
Browser Demo (5 min)
- Navigate to Task Manager
- Click voice button, grant microphone
- Create task via voice: "Add a task to review the Q4 proposal"
- List tasks: "What are my open tasks?"
- Complete task: "Mark the proposal review as done"
- Verify sub-800ms latency in monitoring dashboard
-
Phone Demo (5 min)
- Call the dedicated phone number
- Create task via phone
- Verify call appears in monitoring
- Show cost tracking
-
Screen Share Demo (5 min)
- Share screen showing a webpage/document
- Say: "Add what I'm looking at to my tasks"
- Verify task created with visual context
-
Production Operations Demo (5 min)
- Show Grafana dashboard with live metrics
- Show cost tracking against $0.03-0.07 target
- Show Kubernetes deployment status
- Explain failover strategy
Grading Criteria:
| Criterion | Weight | Excellent | Satisfactory | Needs Improvement |
|---|---|---|---|---|
| Specification Quality | 20% | Complete, measurable, realistic | Mostly complete | Missing key requirements |
| Implementation | 30% | All channels working, <800ms latency | 2/3 channels working | 1 channel working |
| Production Deployment | 25% | K8s + monitoring + alerting | K8s + monitoring | K8s only |
| Cost Achievement | 15% | $0.03-0.07/min achieved | $0.07-0.10/min | >$0.10/min |
| Documentation | 10% | Spec + runbook + architecture | Spec + architecture | Spec only |
VII. Validation Checklist
Chapter-Level Validation:
- Chapter type identified: TECHNICAL (LAYER 4 CAPSTONE)
- Concept density analysis documented: 8 integration concepts across 3 lessons
- Lesson count justified: 3 lessons (spec->implement->deploy pattern)
- All evals covered by lessons
- All lessons map to at least one eval
- NOT a skill-first chapter (capstones compose, not create)
Stage Progression Validation:
- Lesson 1: Layer 4 (Spec-First) - Write specification BEFORE implementation
- Lesson 2: Layer 4 (AI Orchestrates) - Use skills to implement spec
- Lesson 3: Layer 4 (Validation) - Production deployment and operations
- All prior layers (1-3) assumed completed in Chapters 79-84
Cognitive Load Validation:
- Lesson 1: 3 integration concepts <= 10 (B2 limit) PASS
- Lesson 2: 3 integration concepts <= 10 (B2 limit) PASS
- Lesson 3: 2 integration concepts <= 10 (B2-C1 limit) PASS
Capstone Requirements:
- Composes existing skills (not creates new ones)
- Spec-first approach (specification before implementation)
- Multi-channel integration (browser + phone)
- Multimodal integration (voice + vision)
- Production deployment (Kubernetes)
- Observability (monitoring, alerting, cost tracking)
- Business considerations (cost, compliance, SLAs)
Cross-Chapter Dependencies:
- Requires: Chapter 79 (architecture mental models)
- Requires: Chapter 80 (livekit-agents skill)
- Requires: Chapter 81 (pipecat skill)
- Requires: Chapter 82 (OpenAI Realtime understanding)
- Requires: Chapter 83 (Gemini Live for multimodal)
- Requires: Chapter 84 (voice-telephony, web-audio-capture skills)
- Requires: Part 7 (Kubernetes deployment patterns)
Three Roles Validation (Layer 4):
- Spec-driven: Student writes specification, AI validates
- AI orchestrates: AI uses student's skills to implement student's spec
- Student validates: Student tests against spec requirements
VIII. File Structure
67-capstone-production-voice-agent/
├── _category_.json # Existing
├── README.md # Chapter overview (create)
├── 01-system-architecture.md # Lesson 1: Spec-First Design (create)
├── 02-implementation.md # Lesson 2: Implementation (create)
├── 03-production-deployment.md # Lesson 3: Production Ops (create)
└── 04-capstone-assessment.md # Final assessment rubric (create)
IX. Summary
Chapter 85: Production Voice Agent (Capstone) is a 3-lesson Layer 4 integration chapter:
| Lesson | Title | Integration Concepts | Duration | Evals |
|---|---|---|---|---|
| 1 | System Architecture & Specification Design | 3 | 55 min | #1, #2, #7 |
| 2 | Implementation & Integration | 3 | 80 min | #3, #4, #8 |
| 3 | Production Deployment & Operations | 2 | 80 min | #5, #6, #7 |
Total: 8 integration concepts, ~215 minutes, production voice-enabled Task Manager
Capstone Output:
- Production specification for voice-enabled Task Manager
- Multi-channel voice agent (browser + phone + screen share)
- Kubernetes deployment with session persistence
- Observability dashboard with cost tracking
- Compliance and failover documentation
Skills Composed (not created):
livekit-agents,pipecat,voice-telephony,web-audio-capture
Production Targets:
- Sub-800ms end-to-end latency (P95)
- $0.03-0.07 per minute cost
- 99.5% availability
- Multi-channel support (browser + phone)
- Multimodal support (voice + screen sharing)
X. Connection to Book Thesis
This capstone fulfills Part 11's contribution to the book's thesis: "Manufacture Digital FTEs powered by agents, specs, skills."
Students graduate Part 11 with:
- Skills:
livekit-agents,pipecat,voice-telephony,web-audio-capture - Spec: Production specification for voice-enabled Task Manager
- Digital FTE: A 24/7 voice assistant that can answer phones, accept browser commands, and see user screens
The voice-enabled Task Manager is a sellable Digital FTE component—a production-ready voice assistant built on documented specifications and reusable skills.