Updated Feb 23, 2026

Chapter 85: Production Voice Agent (Capstone) - Lesson Plan

Generated by: chapter-planner v2.0.0 (Reasoning-Activated) Source: Part 11 README, Chapters 79-84 context, Deep Search Report Created: 2026-01-02 Constitution: v6.0.0 (Reasoning Mode)

I. Chapter Analysis

Chapter Type

TECHNICAL (LAYER 4 CAPSTONE) - This is the integrative capstone for Part 11. Students orchestrate ALL accumulated knowledge from Chapters 79-84 into a production-grade voice-enabled Task Manager.

Recognition signals:

Learning objectives use "design/architect/deploy/orchestrate"
No new framework concepts - composition of existing skills
Layer 4: Spec-Driven Integration (spec FIRST, then implementation)
Composition of skills: livekit-agents, pipecat, voice-telephony, web-audio-capture
Production deployment with Kubernetes, monitoring, cost optimization
Business considerations: cost analysis, compliance, SLAs

This is NOT a skill-first (L00) chapter - Capstones COMPOSE existing skills, they don't create new ones.

Concept Density Analysis

Core Concepts (from Capstone requirements): 8 integration concepts

Multi-channel architecture (browser WebRTC + phone SIP)
Provider selection strategy (when Native S2S vs Cascaded Pipeline)
Multimodal integration (voice + screen sharing via Gemini)
Conversation design (turn-taking, interruption, barge-in patterns)
Kubernetes voice deployment (session persistence, scaling, GPU nodes)
Observability for voice (latency metrics, transcription quality, cost tracking)
Cost optimization (economy stack: Deepgram + GPT-4o-mini + Cartesia)
Production operations (compliance, failover, SLAs)

Complexity Assessment: Complex (integrative capstone requiring synthesis of 6 chapters)

Proficiency Tier: B2-C1 (capstone requires advanced synthesis)

Justified Lesson Count: 3 lessons (capstone scope)

Lesson 1: System Architecture & Specification Design (Layer 4: Spec-First)
Lesson 2: Implementation & Integration (Layer 4: AI Orchestrates Using Skills)
Lesson 3: Production Deployment & Operations (Layer 4: Validation & Operations)

Reasoning:

Capstones are integrative, not additive - students already learned the concepts
8 integration concepts across 3 lessons = 2-3 concepts per lesson (synthesis, not learning)
Layer 4 requires spec-first approach: write specification BEFORE implementation
3 lessons follow the spec->implement->deploy pattern
Total ~120-150 minutes for complete production voice agent

II. Success Evals (from Part 11 README)

Success Criteria (what students must demonstrate):

Architecture Design: Students architect a multi-channel voice system combining browser (WebRTC) and phone (SIP/Twilio) with justified technology choices
Provider Selection: Students select appropriate providers for their use case (Native S2S for premium UX vs Economy Stack for high volume) with cost analysis
Multimodal Integration: Students integrate voice + screen sharing for context-aware task management
Natural Conversation: Students implement proper turn-taking, semantic interruption detection, and barge-in handling
Production Deployment: Students deploy to Kubernetes with session persistence, HPA, and GPU-aware scheduling
Observability: Students implement voice-specific metrics (latency percentiles, transcription accuracy, cost per call)
Cost Target: Students achieve $0.03-0.07 per minute target with economy stack
End-to-End Latency: Students achieve sub-800ms end-to-end latency

All lessons below map to these evals.

III. Accumulated Skills from Part 11

Skills students bring to this capstone (created in Chapters 80-84):

Chapter	Skill	Purpose in Capstone
62	`livekit-agents`	Browser WebRTC, multi-agent handoff, K8s patterns
63	`pipecat`	Provider flexibility, custom processors
66	`voice-telephony`	Phone integration, SIP, Twilio
66	`web-audio-capture`	Browser audio, Silero VAD

Direct API knowledge (Chapters 82-83):

OpenAI Realtime API for native S2S when premium UX required
Gemini Live API for voice + vision multimodal

Foundational knowledge (Chapter 79):

Architecture decision matrix (Native S2S vs Cascaded)
Latency budgets and optimization
Technology stack tradeoffs

IV. Lesson Sequence

Lesson 1: System Architecture & Specification Design

Title: System Architecture & Specification Design

Learning Objectives:

Write a production specification for a multi-channel voice agent (spec FIRST)
Design system architecture combining LiveKit (browser) and Twilio (phone)
Select providers based on latency, cost, and quality requirements
Justify technology choices with documented trade-offs
Define success metrics and acceptance criteria upfront

Stage: Layer 4 (Spec-Driven Integration) - Specification is the PRIMARY artifact

CEFR Proficiency: B2

Integration Concepts (count: 3):

Multi-channel architecture design
Provider selection strategy
Specification-first design

Cognitive Load Validation: 3 integration concepts (synthesis of known material) <= 10 limit (B2) -> WITHIN LIMIT

Maps to Evals: #1 (Architecture Design), #2 (Provider Selection), #7 (Cost Target)

Key Sections:

The Capstone Project (~5 min)
- What you're building: Voice-enabled Task Manager
- Channels: Browser (WebRTC), Phone (Twilio)
- Capabilities: Voice commands, screen sharing, natural conversation
- Business context: 24/7 voice assistant for task management
- Why spec-first: Define success BEFORE implementation
Write Your Production Specification (~20 min)
- Intent: What problem does this voice agent solve?
- Channels: Browser (LiveKit WebRTC) + Phone (Twilio SIP)
- User Stories:
  - "As a user, I can speak to my Task Manager via browser"
  - "As a user, I can call a dedicated phone number to manage tasks"
  - "As a user, I can share my screen and say 'add this to my tasks'"
- Functional Requirements:
  - Sub-800ms end-to-end latency
  - Natural turn-taking with semantic detection
  - Barge-in support (user can interrupt)
  - Task CRUD via voice (list, create, complete, delete)
- Non-Functional Requirements:
  - Cost target: $0.03-0.07 per minute
  - 99.5% availability
  - GDPR compliance for call recording (if enabled)
- Success Metrics:
  - P95 latency < 800ms
  - Task creation success rate > 95%
  - User satisfaction > 4.0/5.0

Architecture Decision: Native S2S vs Cascaded (~10 min)

Review Chapter 79 decision matrix
For THIS project: Economy stack (cost-sensitive, high volume potential)
Native S2S reserved for: Premium tiers, demo environments

Document the decision in your spec:

## Architecture Decision: Cascaded Pipeline

**Decision**: Use cascaded pipeline (STT -> LLM -> TTS) for primary flow

**Rationale**:
- Cost target $0.03-0.07/min rules out Native S2S (~$0.11/min)
- Economy stack achieves $0.033/min
- Latency target (800ms) achievable with cascaded
- Provider flexibility for future optimization

**Trade-off**: Higher latency (500-800ms vs 200-300ms)
**Mitigation**: Semantic turn detection, optimized providers

Provider Selection (~10 min)
- STT: Deepgram Nova-3 ($0.0077/min, 90ms latency)
- LLM: GPT-4o-mini ($0.0015/min, 200-400ms latency)
- TTS: Cartesia Sonic-3 ($0.024/min, 40-90ms latency)
- VAD: Silero VAD (free, <1ms)
- Total: ~$0.033/min (within target)
- Document provider selection with alternatives

Multi-Channel Architecture Diagram (~10 min)

Draw the system architecture:

┌─────────────────────────────────────────────────────────────┐
│                    Voice Task Manager                       │
└─────────────────────────────────────────────────────────────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
┌────────▼────────┐   ┌───────▼───────┐   ┌────────▼────────┐
│  Browser Client │   │  Phone Client  │   │  Screen Share   │
│  (LiveKit WebRTC)│   │  (Twilio SIP)  │   │  (Gemini Live)  │
└────────┬────────┘   └───────┬───────┘   └────────┬────────┘
         │                     │                     │
         └─────────────────────┼─────────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │   Voice Agent Core   │
                    │   (LiveKit Agents)   │
                    └──────────┬──────────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
┌────────▼────────┐   ┌───────▼───────┐   ┌────────▼────────┐
│   STT Pipeline   │   │   LLM Router   │   │   TTS Pipeline   │
│ (Deepgram Nova-3)│   │ (GPT-4o-mini)  │   │ (Cartesia Sonic) │
└─────────────────┘   └───────┬───────┘   └─────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │   Task Manager API   │
                    │   (from Part 6/7)    │
                    └─────────────────────┘

Explain component responsibilities
Identify integration points

Duration Estimate: 55 minutes

Three Roles Integration (Layer 4 Spec-Driven):

You as Architect:

Write the specification (AI does not write specs for you)
Make architecture decisions with documented trade-offs
Define success metrics

AI as Implementation Partner:

Review specification for completeness
Suggest missing requirements based on production experience
Validate cost calculations

Convergence:

Iterate on specification until both agree it is complete
AI validates that spec covers all capstone requirements

Try With AI Prompts:

Review Your Specification

I wrote a production specification for my voice-enabled Task Manager:

[paste your spec.md]

Review this specification against these criteria:
1. Are all capstone requirements covered? (browser, phone, screen share)
2. Are the success metrics measurable?
3. Is the cost analysis realistic?
4. What am I missing that would cause production issues?

Be critical - I want to find gaps NOW, not during implementation.

What you're learning: Spec validation - catching gaps before implementation.

Validate Architecture Decisions

I chose a cascaded pipeline over Native S2S for cost reasons:

- Economy stack: $0.033/min (Deepgram + GPT-4o-mini + Cartesia)
- Native S2S: $0.11/min (OpenAI Realtime)

My latency target is sub-800ms. Challenge my decision:
1. Can cascaded pipeline actually achieve 800ms?
2. What scenarios would force me to reconsider Native S2S?
3. What's my fallback if economy stack quality is insufficient?

Help me stress-test this decision.

What you're learning: Decision validation - stress-testing architectural choices.

Design the Integration Points

My voice agent needs to integrate with:
- LiveKit for browser WebRTC
- Twilio for phone SIP
- Gemini Live for screen sharing
- Task Manager API (from Part 6/7)

Help me design the integration interfaces:
1. How do different channels converge to the same agent logic?
2. How does the agent know which channel a request came from?
3. How do I handle channel-specific features (screen share only on browser)?

Draw the interface contracts between components.

What you're learning: Integration design - defining clean boundaries between systems.

Lesson 2: Implementation & Integration

Title: Implementation & Integration

Learning Objectives:

Implement voice agent core using accumulated skills (livekit-agents, pipecat)
Integrate browser channel with LiveKit WebRTC
Integrate phone channel with Twilio SIP via voice-telephony skill
Add multimodal screen sharing with Gemini Live API
Implement natural conversation patterns (turn-taking, barge-in)
Connect voice agent to Task Manager API via MCP

Stage: Layer 4 (AI Orchestrates Using Skills) - Implementation follows spec

CEFR Proficiency: B2

Integration Concepts (count: 3):

Multi-channel implementation
Multimodal integration (voice + vision)
Conversation design patterns

Cognitive Load Validation: 3 integration concepts <= 10 limit (B2) -> WITHIN LIMIT

Maps to Evals: #3 (Multimodal Integration), #4 (Natural Conversation), #8 (End-to-End Latency)

Key Sections:

Implementation Strategy (~5 min)
- Spec guides implementation (reference your spec.md)
- Use skills to scaffold, not write from scratch
- Implementation order: Core -> Browser -> Phone -> Screen Share
- Test each channel before integration

Voice Agent Core (~15 min)

Use livekit-agents skill to scaffold agent

Configure economy stack providers:

from livekit.agents import AgentContext
from livekit.plugins import deepgram, openai, cartesia

async def entrypoint(ctx: AgentContext):
    agent = VoiceAgent(
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=cartesia.TTS(model="sonic-3"),
    )

    # Connect Task Manager MCP
    agent.add_mcp_server("./task-manager-mcp")

    await agent.start(ctx)

Configure semantic turn detection
Add MCP server connection for Task Manager

Browser Channel (LiveKit WebRTC) (~15 min)
- Use web-audio-capture skill for browser client
- WebRTC room connection from browser
- Audio capture with Silero VAD
- UI considerations: mute button, speaking indicator
- Test: Voice command -> Task creation -> Confirmation
Phone Channel (Twilio SIP) (~15 min)
- Use voice-telephony skill for Twilio integration
- SIP trunk configuration
- Phone number provisioning
- Inbound call routing to LiveKit agent
- Test: Call number -> Voice command -> Task creation
Multimodal Screen Sharing (~15 min)
- Use Gemini Live API knowledge from Chapter 83
- Screen sharing consent and capture
- Visual context sent to Gemini alongside voice
- Example: "Add what's on my screen to my tasks"
- Integration with Task Manager API
Natural Conversation Patterns (~10 min)
- Semantic turn detection configuration (from Chapter 80)
- Barge-in handling (from Chapter 82)
- Filler speech during async operations
- Confirmation flows for destructive actions
- Test conversational quality
Integration Testing (~5 min)
- End-to-end test: Browser voice command
- End-to-end test: Phone call flow
- End-to-end test: Screen share task creation
- Measure latency against spec target (800ms)

Duration Estimate: 80 minutes

Three Roles Integration (Layer 4 AI Orchestrates):

AI as Implementer:

Use your skills to generate implementation code
AI follows your specification, not its own ideas

You as Validator:

Test each component against spec requirements
Validate latency meets targets
Ensure conversation quality

Convergence:

Iterate on implementation until all spec requirements pass
Document any spec changes discovered during implementation

Try With AI Prompts:

Scaffold the Voice Agent Core

Using my livekit-agents skill, scaffold the voice agent core for my
Task Manager:

From my spec:
- STT: Deepgram Nova-3
- LLM: GPT-4o-mini
- TTS: Cartesia Sonic-3
- MCP: Task Manager API (list_tasks, create_task, complete_task)

Include:
1. Semantic turn detection configuration
2. MCP server connection
3. System prompt for task management persona
4. Graceful error handling

Follow my spec exactly. I'll test and validate.

What you're learning: Spec-driven implementation - AI implements YOUR specification.

Integrate Phone Channel

Using my voice-telephony skill, integrate Twilio phone support:

Requirements from my spec:
- Inbound calls routed to LiveKit agent
- Same conversation logic as browser channel
- Phone-specific greeting: "Task Manager speaking, how can I help?"

I have:
- Twilio account with SIP trunk configured
- LiveKit server running

Walk me through the Twilio -> LiveKit routing configuration.

What you're learning: Channel integration - connecting telephony to voice agent.

Add Screen Sharing

I need to add screen sharing capability using Gemini Live API.

Use case from my spec:
- User shares screen while talking
- Says: "Add what I'm looking at to my tasks"
- Agent sees the screen, extracts context, creates task

Using my knowledge from Chapter 83, help me:
1. Configure Gemini Live for voice + vision
2. Integrate with my LiveKit-based voice agent
3. Handle the screen share permission flow
4. Extract visual context for task creation

This is the multimodal piece of my capstone.

What you're learning: Multimodal integration - combining voice and vision modalities.

Lesson 3: Production Deployment & Operations

Title: Production Deployment & Operations

Learning Objectives:

Deploy voice agent to Kubernetes with production configurations
Implement session persistence across pod restarts
Configure horizontal pod autoscaling for voice workloads
Set up voice-specific observability (latency, quality, cost metrics)
Implement cost monitoring against $0.03-0.07/min target
Document compliance considerations (recording, consent)
Design failover strategies for voice infrastructure

Stage: Layer 4 (Validation & Operations) - Production readiness

CEFR Proficiency: B2-C1

Integration Concepts (count: 2):

Kubernetes voice deployment
Observability and operations

Cognitive Load Validation: 2 integration concepts <= 10 limit (B2-C1) -> WITHIN LIMIT

Maps to Evals: #5 (Production Deployment), #6 (Observability), #7 (Cost Target)

Key Sections:

Kubernetes Deployment Strategy (~10 min)

Review Part 7 Kubernetes patterns
Voice-specific considerations:
- Session affinity for conversation continuity
- Redis for session state persistence
- GPU nodes for VAD/turn detection models (if used)

Deployment architecture:

# voice-agent deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: voice-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: voice-agent
  template:
    spec:
      containers:
      - name: voice-agent
        image: task-manager-voice:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        env:
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: voice-secrets
              key: redis-url

Session Persistence (~10 min)
- Why session persistence matters: Mid-call pod restart
- Redis configuration for session state
- Session reconnection logic
- Test: Pod restart during active call
- Code: Session persistence implementation

Horizontal Pod Autoscaling (~10 min)

Scaling voice workloads (CPU-bound, not memory-bound)

HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voice-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Scaling based on concurrent sessions vs CPU
Test: Load test with scaling

Voice Observability Stack (~15 min)
- Key metrics for voice agents:
  - voice_latency_p95: End-to-end response time (target: 800ms)
  - voice_stt_duration: Speech-to-text processing time
  - voice_llm_duration: LLM response time
  - voice_tts_duration: Text-to-speech processing time
  - voice_cost_per_call: Running cost calculation
  - voice_transcription_errors: STT quality indicator
- Prometheus metrics exposition
- Grafana dashboard for voice operations
- Alerting: Latency > 1s, Cost > $0.10/min

Cost Monitoring & Optimization (~10 min)

Cost tracking implementation:

# Track per-call costs
stt_cost = audio_duration_minutes * 0.0077  # Deepgram
llm_cost = tokens * 0.000002  # GPT-4o-mini
tts_cost = audio_duration_minutes * 0.024  # Cartesia

total_cost = stt_cost + llm_cost + tts_cost
metrics.observe('voice_cost_per_call', total_cost)

Cost dashboard with daily/weekly rollups
Alerting when cost exceeds target
Optimization strategies: Caching, prompt optimization

Compliance & Recording (~10 min)
- Recording consent requirements (varies by jurisdiction)
- GDPR considerations for EU users
- Data retention policies
- Consent flow implementation:
  - Browser: Click-to-consent before microphone
  - Phone: "This call may be recorded for quality purposes"
- Recording storage and access controls
Failover & Resilience (~10 min)
- Provider failover: Deepgram -> Whisper fallback
- Regional failover: Multi-region deployment
- Graceful degradation: Text fallback when voice fails
- SLA considerations: 99.5% availability target
- Incident runbook for voice outages
Production Validation (~5 min)
- Final checklist against spec:
  - Sub-800ms P95 latency
  - $0.03-0.07/min cost achieved
  - Browser channel working
  - Phone channel working
  - Screen share working
  - Session persistence validated
  - Monitoring and alerting active
- Sign-off: Production-ready voice agent

Duration Estimate: 80 minutes

Three Roles Integration (Layer 4 Operations):

AI as Operations Advisor:

Suggest deployment patterns based on your requirements
Help design monitoring and alerting

You as Operator:

Deploy and validate in your environment
Make compliance decisions for your jurisdiction
Own the production system

Convergence:

Production checklist complete
All spec requirements validated
Voice-enabled Task Manager is live

Try With AI Prompts:

Generate Kubernetes Manifests

Using my livekit-agents skill and Part 7 knowledge, generate
Kubernetes manifests for my voice agent:

Requirements from my spec:
- 2-20 replicas based on load
- Redis for session persistence
- Prometheus metrics exposure
- Health checks (liveness + readiness)
- Secrets management for API keys

My cluster: [describe your K8s setup]

Generate the complete manifest set I can apply.

What you're learning: Production deployment - operationalizing voice agents.

Design the Observability Dashboard

I need a Grafana dashboard for monitoring my voice agent:

Key metrics to track:
- End-to-end latency (P50, P95, P99)
- Per-component latency (STT, LLM, TTS)
- Cost per call and daily cost
- Concurrent sessions
- Error rates

Help me design:
1. Dashboard layout and panels
2. Prometheus queries for each metric
3. Alerting rules (latency > 1s, cost > $0.10/min)

I'll implement and configure based on your design.

What you're learning: Voice observability - monitoring what matters for voice systems.

Plan the Failover Strategy

My voice agent needs resilience against:
1. Primary STT provider (Deepgram) goes down
2. Primary TTS provider (Cartesia) goes down
3. AWS region outage

Help me design failover for each scenario:
- Detection: How do I know there's a problem?
- Failover: What's the backup?
- Recovery: How do I return to primary?
- User impact: What do users experience?

I need this documented for my operations runbook.

What you're learning: Resilience engineering - designing for failure in voice systems.

V. Skill Composition

Skills Used in This Capstone (NOT created, composed):

Skill	Source	Role in Capstone
`livekit-agents`	Chapter 80	Core voice agent, WebRTC, multi-agent patterns
`pipecat`	Chapter 81	Provider flexibility, custom processors
`voice-telephony`	Chapter 84	Phone/Twilio integration
`web-audio-capture`	Chapter 84	Browser audio capture

Direct API Knowledge Applied:

OpenAI Realtime API (Chapter 82): Understanding for Native S2S comparison
Gemini Live API (Chapter 83): Multimodal screen sharing integration

No new skills created - Capstones compose existing skills into production systems.

VI. Assessment Plan

Formative Assessments (During Lessons)

Lesson 1: Specification review (complete, measurable, realistic)
Lesson 2: Integration testing (each channel works independently)
Lesson 3: Production checklist (all items validated)

Summative Assessment (End of Chapter)

Capstone Demonstration:

Students demonstrate their production voice agent:

Browser Demo (5 min)
- Navigate to Task Manager
- Click voice button, grant microphone
- Create task via voice: "Add a task to review the Q4 proposal"
- List tasks: "What are my open tasks?"
- Complete task: "Mark the proposal review as done"
- Verify sub-800ms latency in monitoring dashboard
Phone Demo (5 min)
- Call the dedicated phone number
- Create task via phone
- Verify call appears in monitoring
- Show cost tracking
Screen Share Demo (5 min)
- Share screen showing a webpage/document
- Say: "Add what I'm looking at to my tasks"
- Verify task created with visual context
Production Operations Demo (5 min)
- Show Grafana dashboard with live metrics
- Show cost tracking against $0.03-0.07 target
- Show Kubernetes deployment status
- Explain failover strategy

Grading Criteria:

Criterion	Weight	Excellent	Satisfactory	Needs Improvement
Specification Quality	20%	Complete, measurable, realistic	Mostly complete	Missing key requirements
Implementation	30%	All channels working, <800ms latency	2/3 channels working	1 channel working
Production Deployment	25%	K8s + monitoring + alerting	K8s + monitoring	K8s only
Cost Achievement	15%	$0.03-0.07/min achieved	$0.07-0.10/min	>$0.10/min
Documentation	10%	Spec + runbook + architecture	Spec + architecture	Spec only

VII. Validation Checklist

Chapter-Level Validation:

Chapter type identified: TECHNICAL (LAYER 4 CAPSTONE)
Concept density analysis documented: 8 integration concepts across 3 lessons
Lesson count justified: 3 lessons (spec->implement->deploy pattern)
All evals covered by lessons
All lessons map to at least one eval
NOT a skill-first chapter (capstones compose, not create)

Stage Progression Validation:

Lesson 1: Layer 4 (Spec-First) - Write specification BEFORE implementation
Lesson 2: Layer 4 (AI Orchestrates) - Use skills to implement spec
Lesson 3: Layer 4 (Validation) - Production deployment and operations
All prior layers (1-3) assumed completed in Chapters 79-84

Cognitive Load Validation:

Lesson 1: 3 integration concepts <= 10 (B2 limit) PASS
Lesson 2: 3 integration concepts <= 10 (B2 limit) PASS
Lesson 3: 2 integration concepts <= 10 (B2-C1 limit) PASS

Capstone Requirements:

Composes existing skills (not creates new ones)
Spec-first approach (specification before implementation)
Multi-channel integration (browser + phone)
Multimodal integration (voice + vision)
Production deployment (Kubernetes)
Observability (monitoring, alerting, cost tracking)
Business considerations (cost, compliance, SLAs)

Cross-Chapter Dependencies:

Requires: Chapter 79 (architecture mental models)
Requires: Chapter 80 (livekit-agents skill)
Requires: Chapter 81 (pipecat skill)
Requires: Chapter 82 (OpenAI Realtime understanding)
Requires: Chapter 83 (Gemini Live for multimodal)
Requires: Chapter 84 (voice-telephony, web-audio-capture skills)
Requires: Part 7 (Kubernetes deployment patterns)

Three Roles Validation (Layer 4):

Spec-driven: Student writes specification, AI validates
AI orchestrates: AI uses student's skills to implement student's spec
Student validates: Student tests against spec requirements

VIII. File Structure

67-capstone-production-voice-agent/
├── _category_.json              # Existing
├── README.md                    # Chapter overview (create)
├── 01-system-architecture.md    # Lesson 1: Spec-First Design (create)
├── 02-implementation.md         # Lesson 2: Implementation (create)
├── 03-production-deployment.md  # Lesson 3: Production Ops (create)
└── 04-capstone-assessment.md    # Final assessment rubric (create)

IX. Summary

Chapter 85: Production Voice Agent (Capstone) is a 3-lesson Layer 4 integration chapter:

Lesson	Title	Integration Concepts	Duration	Evals
1	System Architecture & Specification Design	3	55 min	#1, #2, #7
2	Implementation & Integration	3	80 min	#3, #4, #8
3	Production Deployment & Operations	2	80 min	#5, #6, #7

Total: 8 integration concepts, ~215 minutes, production voice-enabled Task Manager

Capstone Output:

Production specification for voice-enabled Task Manager
Multi-channel voice agent (browser + phone + screen share)
Kubernetes deployment with session persistence
Observability dashboard with cost tracking
Compliance and failover documentation

Skills Composed (not created):

livekit-agents, pipecat, voice-telephony, web-audio-capture

Production Targets:

Sub-800ms end-to-end latency (P95)
$0.03-0.07 per minute cost
99.5% availability
Multi-channel support (browser + phone)
Multimodal support (voice + screen sharing)

X. Connection to Book Thesis

This capstone fulfills Part 11's contribution to the book's thesis: "Manufacture Digital FTEs powered by agents, specs, skills."

Students graduate Part 11 with:

Skills: livekit-agents, pipecat, voice-telephony, web-audio-capture
Spec: Production specification for voice-enabled Task Manager
Digital FTE: A 24/7 voice assistant that can answer phones, accept browser commands, and see user screens

The voice-enabled Task Manager is a sellable Digital FTE component—a production-ready voice assistant built on documented specifications and reusable skills.

I. Chapter Analysis​

Chapter Type​

Concept Density Analysis​

II. Success Evals (from Part 11 README)​

III. Accumulated Skills from Part 11​

IV. Lesson Sequence​

Lesson 1: System Architecture & Specification Design​

Lesson 2: Implementation & Integration​

Lesson 3: Production Deployment & Operations​

V. Skill Composition​

VI. Assessment Plan​

Formative Assessments (During Lessons)​

Summative Assessment (End of Chapter)​

VII. Validation Checklist​

VIII. File Structure​

IX. Summary​

X. Connection to Book Thesis​

I. Chapter Analysis

Chapter Type

Concept Density Analysis

II. Success Evals (from Part 11 README)

III. Accumulated Skills from Part 11

IV. Lesson Sequence

Lesson 1: System Architecture & Specification Design

Lesson 2: Implementation & Integration

Lesson 3: Production Deployment & Operations

V. Skill Composition

VI. Assessment Plan

Formative Assessments (During Lessons)

Summative Assessment (End of Chapter)

VII. Validation Checklist

VIII. File Structure

IX. Summary

X. Connection to Book Thesis