Skip to main content

Chapter 85: Production Voice Agent (Capstone) - Lesson Plan

Generated by: chapter-planner v2.0.0 (Reasoning-Activated) Source: Part 11 README, Chapters 79-84 context, Deep Search Report Created: 2026-01-02 Constitution: v6.0.0 (Reasoning Mode)


I. Chapter Analysis

Chapter Type

TECHNICAL (LAYER 4 CAPSTONE) - This is the integrative capstone for Part 11. Students orchestrate ALL accumulated knowledge from Chapters 79-84 into a production-grade voice-enabled Task Manager.

Recognition signals:

  • Learning objectives use "design/architect/deploy/orchestrate"
  • No new framework concepts - composition of existing skills
  • Layer 4: Spec-Driven Integration (spec FIRST, then implementation)
  • Composition of skills: livekit-agents, pipecat, voice-telephony, web-audio-capture
  • Production deployment with Kubernetes, monitoring, cost optimization
  • Business considerations: cost analysis, compliance, SLAs

This is NOT a skill-first (L00) chapter - Capstones COMPOSE existing skills, they don't create new ones.

Concept Density Analysis

Core Concepts (from Capstone requirements): 8 integration concepts

  1. Multi-channel architecture (browser WebRTC + phone SIP)
  2. Provider selection strategy (when Native S2S vs Cascaded Pipeline)
  3. Multimodal integration (voice + screen sharing via Gemini)
  4. Conversation design (turn-taking, interruption, barge-in patterns)
  5. Kubernetes voice deployment (session persistence, scaling, GPU nodes)
  6. Observability for voice (latency metrics, transcription quality, cost tracking)
  7. Cost optimization (economy stack: Deepgram + GPT-4o-mini + Cartesia)
  8. Production operations (compliance, failover, SLAs)

Complexity Assessment: Complex (integrative capstone requiring synthesis of 6 chapters)

Proficiency Tier: B2-C1 (capstone requires advanced synthesis)

Justified Lesson Count: 3 lessons (capstone scope)

  • Lesson 1: System Architecture & Specification Design (Layer 4: Spec-First)
  • Lesson 2: Implementation & Integration (Layer 4: AI Orchestrates Using Skills)
  • Lesson 3: Production Deployment & Operations (Layer 4: Validation & Operations)

Reasoning:

  • Capstones are integrative, not additive - students already learned the concepts
  • 8 integration concepts across 3 lessons = 2-3 concepts per lesson (synthesis, not learning)
  • Layer 4 requires spec-first approach: write specification BEFORE implementation
  • 3 lessons follow the spec->implement->deploy pattern
  • Total ~120-150 minutes for complete production voice agent

II. Success Evals (from Part 11 README)

Success Criteria (what students must demonstrate):

  1. Architecture Design: Students architect a multi-channel voice system combining browser (WebRTC) and phone (SIP/Twilio) with justified technology choices
  2. Provider Selection: Students select appropriate providers for their use case (Native S2S for premium UX vs Economy Stack for high volume) with cost analysis
  3. Multimodal Integration: Students integrate voice + screen sharing for context-aware task management
  4. Natural Conversation: Students implement proper turn-taking, semantic interruption detection, and barge-in handling
  5. Production Deployment: Students deploy to Kubernetes with session persistence, HPA, and GPU-aware scheduling
  6. Observability: Students implement voice-specific metrics (latency percentiles, transcription accuracy, cost per call)
  7. Cost Target: Students achieve $0.03-0.07 per minute target with economy stack
  8. End-to-End Latency: Students achieve sub-800ms end-to-end latency

All lessons below map to these evals.


III. Accumulated Skills from Part 11

Skills students bring to this capstone (created in Chapters 80-84):

ChapterSkillPurpose in Capstone
62livekit-agentsBrowser WebRTC, multi-agent handoff, K8s patterns
63pipecatProvider flexibility, custom processors
66voice-telephonyPhone integration, SIP, Twilio
66web-audio-captureBrowser audio, Silero VAD

Direct API knowledge (Chapters 82-83):

  • OpenAI Realtime API for native S2S when premium UX required
  • Gemini Live API for voice + vision multimodal

Foundational knowledge (Chapter 79):

  • Architecture decision matrix (Native S2S vs Cascaded)
  • Latency budgets and optimization
  • Technology stack tradeoffs

IV. Lesson Sequence


Lesson 1: System Architecture & Specification Design

Title: System Architecture & Specification Design

Learning Objectives:

  • Write a production specification for a multi-channel voice agent (spec FIRST)
  • Design system architecture combining LiveKit (browser) and Twilio (phone)
  • Select providers based on latency, cost, and quality requirements
  • Justify technology choices with documented trade-offs
  • Define success metrics and acceptance criteria upfront

Stage: Layer 4 (Spec-Driven Integration) - Specification is the PRIMARY artifact

CEFR Proficiency: B2

Integration Concepts (count: 3):

  1. Multi-channel architecture design
  2. Provider selection strategy
  3. Specification-first design

Cognitive Load Validation: 3 integration concepts (synthesis of known material) <= 10 limit (B2) -> WITHIN LIMIT

Maps to Evals: #1 (Architecture Design), #2 (Provider Selection), #7 (Cost Target)

Key Sections:

  1. The Capstone Project (~5 min)

    • What you're building: Voice-enabled Task Manager
    • Channels: Browser (WebRTC), Phone (Twilio)
    • Capabilities: Voice commands, screen sharing, natural conversation
    • Business context: 24/7 voice assistant for task management
    • Why spec-first: Define success BEFORE implementation
  2. Write Your Production Specification (~20 min)

    • Intent: What problem does this voice agent solve?
    • Channels: Browser (LiveKit WebRTC) + Phone (Twilio SIP)
    • User Stories:
      • "As a user, I can speak to my Task Manager via browser"
      • "As a user, I can call a dedicated phone number to manage tasks"
      • "As a user, I can share my screen and say 'add this to my tasks'"
    • Functional Requirements:
      • Sub-800ms end-to-end latency
      • Natural turn-taking with semantic detection
      • Barge-in support (user can interrupt)
      • Task CRUD via voice (list, create, complete, delete)
    • Non-Functional Requirements:
      • Cost target: $0.03-0.07 per minute
      • 99.5% availability
      • GDPR compliance for call recording (if enabled)
    • Success Metrics:
      • P95 latency < 800ms
      • Task creation success rate > 95%
      • User satisfaction > 4.0/5.0
  3. Architecture Decision: Native S2S vs Cascaded (~10 min)

    • Review Chapter 79 decision matrix
    • For THIS project: Economy stack (cost-sensitive, high volume potential)
    • Native S2S reserved for: Premium tiers, demo environments
    • Document the decision in your spec:
      ## Architecture Decision: Cascaded Pipeline

      **Decision**: Use cascaded pipeline (STT -> LLM -> TTS) for primary flow

      **Rationale**:
      - Cost target $0.03-0.07/min rules out Native S2S (~$0.11/min)
      - Economy stack achieves $0.033/min
      - Latency target (800ms) achievable with cascaded
      - Provider flexibility for future optimization

      **Trade-off**: Higher latency (500-800ms vs 200-300ms)
      **Mitigation**: Semantic turn detection, optimized providers
  4. Provider Selection (~10 min)

    • STT: Deepgram Nova-3 ($0.0077/min, 90ms latency)
    • LLM: GPT-4o-mini ($0.0015/min, 200-400ms latency)
    • TTS: Cartesia Sonic-3 ($0.024/min, 40-90ms latency)
    • VAD: Silero VAD (free, <1ms)
    • Total: ~$0.033/min (within target)
    • Document provider selection with alternatives
  5. Multi-Channel Architecture Diagram (~10 min)

    • Draw the system architecture:
      ┌─────────────────────────────────────────────────────────────┐
      │ Voice Task Manager │
      └─────────────────────────────────────────────────────────────┘

      ┌─────────────────────┼─────────────────────┐
      │ │ │
      ┌────────▼────────┐ ┌───────▼───────┐ ┌────────▼────────┐
      │ Browser Client │ │ Phone Client │ │ Screen Share │
      │ (LiveKit WebRTC)│ │ (Twilio SIP) │ │ (Gemini Live) │
      └────────┬────────┘ └───────┬───────┘ └────────┬────────┘
      │ │ │
      └─────────────────────┼─────────────────────┘

      ┌──────────▼──────────┐
      │ Voice Agent Core │
      │ (LiveKit Agents) │
      └──────────┬──────────┘

      ┌─────────────────────┼─────────────────────┐
      │ │ │
      ┌────────▼────────┐ ┌───────▼───────┐ ┌────────▼────────┐
      │ STT Pipeline │ │ LLM Router │ │ TTS Pipeline │
      │ (Deepgram Nova-3)│ │ (GPT-4o-mini) │ │ (Cartesia Sonic) │
      └─────────────────┘ └───────┬───────┘ └─────────────────┘

      ┌──────────▼──────────┐
      │ Task Manager API │
      │ (from Part 6/7) │
      └─────────────────────┘
    • Explain component responsibilities
    • Identify integration points

Duration Estimate: 55 minutes

Three Roles Integration (Layer 4 Spec-Driven):

You as Architect:

  • Write the specification (AI does not write specs for you)
  • Make architecture decisions with documented trade-offs
  • Define success metrics

AI as Implementation Partner:

  • Review specification for completeness
  • Suggest missing requirements based on production experience
  • Validate cost calculations

Convergence:

  • Iterate on specification until both agree it is complete
  • AI validates that spec covers all capstone requirements

Try With AI Prompts:

  1. Review Your Specification

    I wrote a production specification for my voice-enabled Task Manager:

    [paste your spec.md]

    Review this specification against these criteria:
    1. Are all capstone requirements covered? (browser, phone, screen share)
    2. Are the success metrics measurable?
    3. Is the cost analysis realistic?
    4. What am I missing that would cause production issues?

    Be critical - I want to find gaps NOW, not during implementation.

    What you're learning: Spec validation - catching gaps before implementation.

  2. Validate Architecture Decisions

    I chose a cascaded pipeline over Native S2S for cost reasons:

    - Economy stack: $0.033/min (Deepgram + GPT-4o-mini + Cartesia)
    - Native S2S: $0.11/min (OpenAI Realtime)

    My latency target is sub-800ms. Challenge my decision:
    1. Can cascaded pipeline actually achieve 800ms?
    2. What scenarios would force me to reconsider Native S2S?
    3. What's my fallback if economy stack quality is insufficient?

    Help me stress-test this decision.

    What you're learning: Decision validation - stress-testing architectural choices.

  3. Design the Integration Points

    My voice agent needs to integrate with:
    - LiveKit for browser WebRTC
    - Twilio for phone SIP
    - Gemini Live for screen sharing
    - Task Manager API (from Part 6/7)

    Help me design the integration interfaces:
    1. How do different channels converge to the same agent logic?
    2. How does the agent know which channel a request came from?
    3. How do I handle channel-specific features (screen share only on browser)?

    Draw the interface contracts between components.

    What you're learning: Integration design - defining clean boundaries between systems.


Lesson 2: Implementation & Integration

Title: Implementation & Integration

Learning Objectives:

  • Implement voice agent core using accumulated skills (livekit-agents, pipecat)
  • Integrate browser channel with LiveKit WebRTC
  • Integrate phone channel with Twilio SIP via voice-telephony skill
  • Add multimodal screen sharing with Gemini Live API
  • Implement natural conversation patterns (turn-taking, barge-in)
  • Connect voice agent to Task Manager API via MCP

Stage: Layer 4 (AI Orchestrates Using Skills) - Implementation follows spec

CEFR Proficiency: B2

Integration Concepts (count: 3):

  1. Multi-channel implementation
  2. Multimodal integration (voice + vision)
  3. Conversation design patterns

Cognitive Load Validation: 3 integration concepts <= 10 limit (B2) -> WITHIN LIMIT

Maps to Evals: #3 (Multimodal Integration), #4 (Natural Conversation), #8 (End-to-End Latency)

Key Sections:

  1. Implementation Strategy (~5 min)

    • Spec guides implementation (reference your spec.md)
    • Use skills to scaffold, not write from scratch
    • Implementation order: Core -> Browser -> Phone -> Screen Share
    • Test each channel before integration
  2. Voice Agent Core (~15 min)

    • Use livekit-agents skill to scaffold agent
    • Configure economy stack providers:
      from livekit.agents import AgentContext
      from livekit.plugins import deepgram, openai, cartesia

      async def entrypoint(ctx: AgentContext):
      agent = VoiceAgent(
      stt=deepgram.STT(model="nova-3"),
      llm=openai.LLM(model="gpt-4o-mini"),
      tts=cartesia.TTS(model="sonic-3"),
      )

      # Connect Task Manager MCP
      agent.add_mcp_server("./task-manager-mcp")

      await agent.start(ctx)
    • Configure semantic turn detection
    • Add MCP server connection for Task Manager
  3. Browser Channel (LiveKit WebRTC) (~15 min)

    • Use web-audio-capture skill for browser client
    • WebRTC room connection from browser
    • Audio capture with Silero VAD
    • UI considerations: mute button, speaking indicator
    • Test: Voice command -> Task creation -> Confirmation
  4. Phone Channel (Twilio SIP) (~15 min)

    • Use voice-telephony skill for Twilio integration
    • SIP trunk configuration
    • Phone number provisioning
    • Inbound call routing to LiveKit agent
    • Test: Call number -> Voice command -> Task creation
  5. Multimodal Screen Sharing (~15 min)

    • Use Gemini Live API knowledge from Chapter 83
    • Screen sharing consent and capture
    • Visual context sent to Gemini alongside voice
    • Example: "Add what's on my screen to my tasks"
    • Integration with Task Manager API
  6. Natural Conversation Patterns (~10 min)

    • Semantic turn detection configuration (from Chapter 80)
    • Barge-in handling (from Chapter 82)
    • Filler speech during async operations
    • Confirmation flows for destructive actions
    • Test conversational quality
  7. Integration Testing (~5 min)

    • End-to-end test: Browser voice command
    • End-to-end test: Phone call flow
    • End-to-end test: Screen share task creation
    • Measure latency against spec target (800ms)

Duration Estimate: 80 minutes

Three Roles Integration (Layer 4 AI Orchestrates):

AI as Implementer:

  • Use your skills to generate implementation code
  • AI follows your specification, not its own ideas

You as Validator:

  • Test each component against spec requirements
  • Validate latency meets targets
  • Ensure conversation quality

Convergence:

  • Iterate on implementation until all spec requirements pass
  • Document any spec changes discovered during implementation

Try With AI Prompts:

  1. Scaffold the Voice Agent Core

    Using my livekit-agents skill, scaffold the voice agent core for my
    Task Manager:

    From my spec:
    - STT: Deepgram Nova-3
    - LLM: GPT-4o-mini
    - TTS: Cartesia Sonic-3
    - MCP: Task Manager API (list_tasks, create_task, complete_task)

    Include:
    1. Semantic turn detection configuration
    2. MCP server connection
    3. System prompt for task management persona
    4. Graceful error handling

    Follow my spec exactly. I'll test and validate.

    What you're learning: Spec-driven implementation - AI implements YOUR specification.

  2. Integrate Phone Channel

    Using my voice-telephony skill, integrate Twilio phone support:

    Requirements from my spec:
    - Inbound calls routed to LiveKit agent
    - Same conversation logic as browser channel
    - Phone-specific greeting: "Task Manager speaking, how can I help?"

    I have:
    - Twilio account with SIP trunk configured
    - LiveKit server running

    Walk me through the Twilio -> LiveKit routing configuration.

    What you're learning: Channel integration - connecting telephony to voice agent.

  3. Add Screen Sharing

    I need to add screen sharing capability using Gemini Live API.

    Use case from my spec:
    - User shares screen while talking
    - Says: "Add what I'm looking at to my tasks"
    - Agent sees the screen, extracts context, creates task

    Using my knowledge from Chapter 83, help me:
    1. Configure Gemini Live for voice + vision
    2. Integrate with my LiveKit-based voice agent
    3. Handle the screen share permission flow
    4. Extract visual context for task creation

    This is the multimodal piece of my capstone.

    What you're learning: Multimodal integration - combining voice and vision modalities.


Lesson 3: Production Deployment & Operations

Title: Production Deployment & Operations

Learning Objectives:

  • Deploy voice agent to Kubernetes with production configurations
  • Implement session persistence across pod restarts
  • Configure horizontal pod autoscaling for voice workloads
  • Set up voice-specific observability (latency, quality, cost metrics)
  • Implement cost monitoring against $0.03-0.07/min target
  • Document compliance considerations (recording, consent)
  • Design failover strategies for voice infrastructure

Stage: Layer 4 (Validation & Operations) - Production readiness

CEFR Proficiency: B2-C1

Integration Concepts (count: 2):

  1. Kubernetes voice deployment
  2. Observability and operations

Cognitive Load Validation: 2 integration concepts <= 10 limit (B2-C1) -> WITHIN LIMIT

Maps to Evals: #5 (Production Deployment), #6 (Observability), #7 (Cost Target)

Key Sections:

  1. Kubernetes Deployment Strategy (~10 min)

    • Review Part 7 Kubernetes patterns
    • Voice-specific considerations:
      • Session affinity for conversation continuity
      • Redis for session state persistence
      • GPU nodes for VAD/turn detection models (if used)
    • Deployment architecture:
      # voice-agent deployment
      apiVersion: apps/v1
      kind: Deployment
      metadata:
      name: voice-agent
      spec:
      replicas: 3
      selector:
      matchLabels:
      app: voice-agent
      template:
      spec:
      containers:
      - name: voice-agent
      image: task-manager-voice:latest
      resources:
      requests:
      memory: "512Mi"
      cpu: "500m"
      limits:
      memory: "1Gi"
      cpu: "1000m"
      env:
      - name: REDIS_URL
      valueFrom:
      secretKeyRef:
      name: voice-secrets
      key: redis-url
  2. Session Persistence (~10 min)

    • Why session persistence matters: Mid-call pod restart
    • Redis configuration for session state
    • Session reconnection logic
    • Test: Pod restart during active call
    • Code: Session persistence implementation
  3. Horizontal Pod Autoscaling (~10 min)

    • Scaling voice workloads (CPU-bound, not memory-bound)
    • HPA configuration:
      apiVersion: autoscaling/v2
      kind: HorizontalPodAutoscaler
      metadata:
      name: voice-agent-hpa
      spec:
      scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: voice-agent
      minReplicas: 2
      maxReplicas: 20
      metrics:
      - type: Resource
      resource:
      name: cpu
      target:
      type: Utilization
      averageUtilization: 70
    • Scaling based on concurrent sessions vs CPU
    • Test: Load test with scaling
  4. Voice Observability Stack (~15 min)

    • Key metrics for voice agents:
      • voice_latency_p95: End-to-end response time (target: 800ms)
      • voice_stt_duration: Speech-to-text processing time
      • voice_llm_duration: LLM response time
      • voice_tts_duration: Text-to-speech processing time
      • voice_cost_per_call: Running cost calculation
      • voice_transcription_errors: STT quality indicator
    • Prometheus metrics exposition
    • Grafana dashboard for voice operations
    • Alerting: Latency > 1s, Cost > $0.10/min
  5. Cost Monitoring & Optimization (~10 min)

    • Cost tracking implementation:
      # Track per-call costs
      stt_cost = audio_duration_minutes * 0.0077 # Deepgram
      llm_cost = tokens * 0.000002 # GPT-4o-mini
      tts_cost = audio_duration_minutes * 0.024 # Cartesia

      total_cost = stt_cost + llm_cost + tts_cost
      metrics.observe('voice_cost_per_call', total_cost)
    • Cost dashboard with daily/weekly rollups
    • Alerting when cost exceeds target
    • Optimization strategies: Caching, prompt optimization
  6. Compliance & Recording (~10 min)

    • Recording consent requirements (varies by jurisdiction)
    • GDPR considerations for EU users
    • Data retention policies
    • Consent flow implementation:
      • Browser: Click-to-consent before microphone
      • Phone: "This call may be recorded for quality purposes"
    • Recording storage and access controls
  7. Failover & Resilience (~10 min)

    • Provider failover: Deepgram -> Whisper fallback
    • Regional failover: Multi-region deployment
    • Graceful degradation: Text fallback when voice fails
    • SLA considerations: 99.5% availability target
    • Incident runbook for voice outages
  8. Production Validation (~5 min)

    • Final checklist against spec:
      • Sub-800ms P95 latency
      • $0.03-0.07/min cost achieved
      • Browser channel working
      • Phone channel working
      • Screen share working
      • Session persistence validated
      • Monitoring and alerting active
    • Sign-off: Production-ready voice agent

Duration Estimate: 80 minutes

Three Roles Integration (Layer 4 Operations):

AI as Operations Advisor:

  • Suggest deployment patterns based on your requirements
  • Help design monitoring and alerting

You as Operator:

  • Deploy and validate in your environment
  • Make compliance decisions for your jurisdiction
  • Own the production system

Convergence:

  • Production checklist complete
  • All spec requirements validated
  • Voice-enabled Task Manager is live

Try With AI Prompts:

  1. Generate Kubernetes Manifests

    Using my livekit-agents skill and Part 7 knowledge, generate
    Kubernetes manifests for my voice agent:

    Requirements from my spec:
    - 2-20 replicas based on load
    - Redis for session persistence
    - Prometheus metrics exposure
    - Health checks (liveness + readiness)
    - Secrets management for API keys

    My cluster: [describe your K8s setup]

    Generate the complete manifest set I can apply.

    What you're learning: Production deployment - operationalizing voice agents.

  2. Design the Observability Dashboard

    I need a Grafana dashboard for monitoring my voice agent:

    Key metrics to track:
    - End-to-end latency (P50, P95, P99)
    - Per-component latency (STT, LLM, TTS)
    - Cost per call and daily cost
    - Concurrent sessions
    - Error rates

    Help me design:
    1. Dashboard layout and panels
    2. Prometheus queries for each metric
    3. Alerting rules (latency > 1s, cost > $0.10/min)

    I'll implement and configure based on your design.

    What you're learning: Voice observability - monitoring what matters for voice systems.

  3. Plan the Failover Strategy

    My voice agent needs resilience against:
    1. Primary STT provider (Deepgram) goes down
    2. Primary TTS provider (Cartesia) goes down
    3. AWS region outage

    Help me design failover for each scenario:
    - Detection: How do I know there's a problem?
    - Failover: What's the backup?
    - Recovery: How do I return to primary?
    - User impact: What do users experience?

    I need this documented for my operations runbook.

    What you're learning: Resilience engineering - designing for failure in voice systems.


V. Skill Composition

Skills Used in This Capstone (NOT created, composed):

SkillSourceRole in Capstone
livekit-agentsChapter 80Core voice agent, WebRTC, multi-agent patterns
pipecatChapter 81Provider flexibility, custom processors
voice-telephonyChapter 84Phone/Twilio integration
web-audio-captureChapter 84Browser audio capture

Direct API Knowledge Applied:

  • OpenAI Realtime API (Chapter 82): Understanding for Native S2S comparison
  • Gemini Live API (Chapter 83): Multimodal screen sharing integration

No new skills created - Capstones compose existing skills into production systems.


VI. Assessment Plan

Formative Assessments (During Lessons)

  • Lesson 1: Specification review (complete, measurable, realistic)
  • Lesson 2: Integration testing (each channel works independently)
  • Lesson 3: Production checklist (all items validated)

Summative Assessment (End of Chapter)

Capstone Demonstration:

Students demonstrate their production voice agent:

  1. Browser Demo (5 min)

    • Navigate to Task Manager
    • Click voice button, grant microphone
    • Create task via voice: "Add a task to review the Q4 proposal"
    • List tasks: "What are my open tasks?"
    • Complete task: "Mark the proposal review as done"
    • Verify sub-800ms latency in monitoring dashboard
  2. Phone Demo (5 min)

    • Call the dedicated phone number
    • Create task via phone
    • Verify call appears in monitoring
    • Show cost tracking
  3. Screen Share Demo (5 min)

    • Share screen showing a webpage/document
    • Say: "Add what I'm looking at to my tasks"
    • Verify task created with visual context
  4. Production Operations Demo (5 min)

    • Show Grafana dashboard with live metrics
    • Show cost tracking against $0.03-0.07 target
    • Show Kubernetes deployment status
    • Explain failover strategy

Grading Criteria:

CriterionWeightExcellentSatisfactoryNeeds Improvement
Specification Quality20%Complete, measurable, realisticMostly completeMissing key requirements
Implementation30%All channels working, <800ms latency2/3 channels working1 channel working
Production Deployment25%K8s + monitoring + alertingK8s + monitoringK8s only
Cost Achievement15%$0.03-0.07/min achieved$0.07-0.10/min>$0.10/min
Documentation10%Spec + runbook + architectureSpec + architectureSpec only

VII. Validation Checklist

Chapter-Level Validation:

  • Chapter type identified: TECHNICAL (LAYER 4 CAPSTONE)
  • Concept density analysis documented: 8 integration concepts across 3 lessons
  • Lesson count justified: 3 lessons (spec->implement->deploy pattern)
  • All evals covered by lessons
  • All lessons map to at least one eval
  • NOT a skill-first chapter (capstones compose, not create)

Stage Progression Validation:

  • Lesson 1: Layer 4 (Spec-First) - Write specification BEFORE implementation
  • Lesson 2: Layer 4 (AI Orchestrates) - Use skills to implement spec
  • Lesson 3: Layer 4 (Validation) - Production deployment and operations
  • All prior layers (1-3) assumed completed in Chapters 79-84

Cognitive Load Validation:

  • Lesson 1: 3 integration concepts <= 10 (B2 limit) PASS
  • Lesson 2: 3 integration concepts <= 10 (B2 limit) PASS
  • Lesson 3: 2 integration concepts <= 10 (B2-C1 limit) PASS

Capstone Requirements:

  • Composes existing skills (not creates new ones)
  • Spec-first approach (specification before implementation)
  • Multi-channel integration (browser + phone)
  • Multimodal integration (voice + vision)
  • Production deployment (Kubernetes)
  • Observability (monitoring, alerting, cost tracking)
  • Business considerations (cost, compliance, SLAs)

Cross-Chapter Dependencies:

  • Requires: Chapter 79 (architecture mental models)
  • Requires: Chapter 80 (livekit-agents skill)
  • Requires: Chapter 81 (pipecat skill)
  • Requires: Chapter 82 (OpenAI Realtime understanding)
  • Requires: Chapter 83 (Gemini Live for multimodal)
  • Requires: Chapter 84 (voice-telephony, web-audio-capture skills)
  • Requires: Part 7 (Kubernetes deployment patterns)

Three Roles Validation (Layer 4):

  • Spec-driven: Student writes specification, AI validates
  • AI orchestrates: AI uses student's skills to implement student's spec
  • Student validates: Student tests against spec requirements

VIII. File Structure

67-capstone-production-voice-agent/
├── _category_.json # Existing
├── README.md # Chapter overview (create)
├── 01-system-architecture.md # Lesson 1: Spec-First Design (create)
├── 02-implementation.md # Lesson 2: Implementation (create)
├── 03-production-deployment.md # Lesson 3: Production Ops (create)
└── 04-capstone-assessment.md # Final assessment rubric (create)

IX. Summary

Chapter 85: Production Voice Agent (Capstone) is a 3-lesson Layer 4 integration chapter:

LessonTitleIntegration ConceptsDurationEvals
1System Architecture & Specification Design355 min#1, #2, #7
2Implementation & Integration380 min#3, #4, #8
3Production Deployment & Operations280 min#5, #6, #7

Total: 8 integration concepts, ~215 minutes, production voice-enabled Task Manager

Capstone Output:

  • Production specification for voice-enabled Task Manager
  • Multi-channel voice agent (browser + phone + screen share)
  • Kubernetes deployment with session persistence
  • Observability dashboard with cost tracking
  • Compliance and failover documentation

Skills Composed (not created):

  • livekit-agents, pipecat, voice-telephony, web-audio-capture

Production Targets:

  • Sub-800ms end-to-end latency (P95)
  • $0.03-0.07 per minute cost
  • 99.5% availability
  • Multi-channel support (browser + phone)
  • Multimodal support (voice + screen sharing)

X. Connection to Book Thesis

This capstone fulfills Part 11's contribution to the book's thesis: "Manufacture Digital FTEs powered by agents, specs, skills."

Students graduate Part 11 with:

  1. Skills: livekit-agents, pipecat, voice-telephony, web-audio-capture
  2. Spec: Production specification for voice-enabled Task Manager
  3. Digital FTE: A 24/7 voice assistant that can answer phones, accept browser commands, and see user screens

The voice-enabled Task Manager is a sellable Digital FTE component—a production-ready voice assistant built on documented specifications and reusable skills.