Chapter 79: Voice AI Fundamentals - Lesson Plan
Generated by: chapter-planner v2.0.0 (Reasoning-Activated) Source: Part 11 README, Deep Search Report Created: 2026-01-01 Constitution: v6.0.0 (Reasoning Mode)
I. Chapter Analysis
Chapter Type
CONCEPTUAL - This chapter builds mental models before building systems. No code implementation, no skill creation. Students need to understand the landscape, architectures, and technology stack before Chapters 80-85 introduce implementation.
Recognition signals:
- Learning objectives use "understand/explain/compare/evaluate"
- No code examples required
- Focus on decision frameworks and mental models
- Prepares students for technical chapters that follow
Concept Density Analysis
Core Concepts (from Part 11 README + Deep Search): 8 concepts
- Voice AI market reality and opportunity
- Framework-first vs API-first approaches
- LiveKit vs Pipecat positioning
- Native Speech-to-Speech (S2S) architecture
- Cascaded Pipeline (STT → LLM → TTS) architecture
- Latency budgets and tradeoffs
- Voice technology stack (STT, TTS, VAD providers)
- Transport protocols (WebRTC vs WebSocket)
Complexity Assessment: Standard (conceptual with decision frameworks)
Proficiency Tier: B1 (Part 11 requires Parts 6, 7, 9, 10 completed)
Justified Lesson Count: 3 lessons
- Lesson 1: Landscape orientation (concepts 1-3)
- Lesson 2: Architecture deep-dive (concepts 4-6)
- Lesson 3: Technology stack (concepts 7-8)
Reasoning: 8 concepts across 3 lessons = ~2.7 concepts per lesson, well within B1 limit of 10. Conceptual chapters benefit from focused, shorter lessons that build mental models progressively.
II. Success Evals (Derived from Part 11 README)
Success Criteria (what students must achieve):
- Landscape Understanding: Students can explain why voice matters for Digital FTEs and articulate the $5.4B → $47.5B market opportunity
- Framework Decision: Students can compare LiveKit Agents vs Pipecat and justify which to use for a given scenario
- Architecture Selection: Students can explain when to use Native S2S (
$0.45/4min, 200-300ms) vs Cascaded Pipeline ($0.03/min, 500-800ms) - Latency Budget: Students can articulate the latency budget breakdown (40ms mic → 90ms STT → 200ms LLM → 75ms TTS)
- Stack Components: Students can name the key providers (Deepgram Nova-3, Cartesia Sonic-3, Silero VAD) and their roles
All lessons below map to these evals.
III. Lesson Sequence
Lesson 1: The Voice AI Landscape
Title: The Voice AI Landscape
Learning Objectives:
- Explain why voice is the natural interface for Digital FTEs (accessibility, 24/7 availability, emotional connection)
- Articulate the market reality ($5.4B 2024 → $47.5B 2034) and what it signals for career investment
- Compare framework-first (LiveKit, Pipecat) vs API-first (raw OpenAI Realtime) approaches and identify when each applies
- Evaluate LiveKit Agents vs Pipecat positioning using concrete differentiators (ChatGPT Voice, SIP support, 40+ integrations)
Stage: Layer 1 (Manual Foundation) - First exposure to voice AI concepts
CEFR Proficiency: B1
New Concepts (count: 3):
- Voice as Digital FTE interface (why voice, not just text)
- Framework-first vs API-first thinking
- LiveKit vs Pipecat strategic positioning
Cognitive Load Validation: 3 concepts <= 10 limit (B1) -> WITHIN LIMIT
Maps to Evals: #1 (Landscape Understanding), #2 (Framework Decision)
Key Sections:
-
Why Voice Changes Everything for AI (~5 min)
- Voice is natural: humans evolved to speak, not type
- 24/7 availability (voice agents answer phones at 3 AM)
- Hands-free interaction (driving, cooking, working)
- Emotional connection (tone, empathy, urgency)
- Accessibility (serves users who cannot type)
-
The Market Reality (~5 min)
- $5.4B (2024) → $47.5B (2034) market projection
- LiveKit Agents powers ChatGPT's Advanced Voice Mode
- Pipecat integrates 40+ AI services
- Infrastructure is production-ready, not experimental
-
Framework-First Thinking (~5 min)
- Traditional teaching: raw APIs, "three-model pipeline"
- Modern reality: frameworks abstract complexity
- Decision framework: Use frameworks for production, APIs for edge cases
- Why we teach frameworks first, then direct APIs
-
LiveKit vs Pipecat: Strategic Positioning (~5 min)
- LiveKit Agents (8,200+ stars): Powers ChatGPT Voice, native SIP, semantic turn detection
- Pipecat (8,900+ stars): 40+ integrations, frame-based pipeline, vendor neutral
- Decision matrix: When to choose which
- Not competitors - different philosophies for different needs
Duration Estimate: 20 minutes
Try With AI Prompts:
-
Explore Your Voice AI Position
I'm learning about voice AI for Digital FTEs. I currently work in
[your field/industry]. Help me understand:
1. What voice interactions already exist in my field?
2. Where are the gaps - tasks that SHOULD be voice but aren't?
3. What would a voice-enabled Digital FTE look like for my domain?
Ask me clarifying questions about my specific workflows.What you're learning: Domain translation - connecting abstract market trends to your specific professional context.
-
Stress-Test the Framework-First Claim
I've been told to "use frameworks first, raw APIs for edge cases"
in voice AI. Help me challenge this:
1. When would raw API access actually be NECESSARY?
2. What do you lose by using frameworks vs direct APIs?
3. Are there scenarios where framework abstraction hurts more than helps?
Give me specific technical scenarios, not just abstract principles.What you're learning: Critical evaluation of pedagogical claims - understanding the tradeoffs behind teaching decisions.
-
Compare Framework Philosophies
LiveKit Agents and Pipecat both build voice AI, but they have
different philosophies:
- LiveKit: "Powers ChatGPT Voice, native SIP, semantic turn detection"
- Pipecat: "40+ integrations, frame-based pipeline, vendor neutral"
Help me understand: What KIND of project would make LiveKit the
obvious choice? What KIND would make Pipecat better? Ask me about
what I'm trying to build so we can figure out which fits my needs.What you're learning: Decision framework construction - building intuition for technology selection through concrete scenarios.
Lesson 2: Voice AI Architectures
Title: Voice AI Architectures
Learning Objectives:
- Explain how Native Speech-to-Speech (OpenAI Realtime, Gemini Live) achieves 200-300ms latency
- Describe the Cascaded Pipeline (STT → LLM → TTS) architecture and its latency budget breakdown
- Compare cost/latency tradeoffs: S2S (
$0.45/4min) vs Cascaded ($0.03/min) - Apply decision matrix to select appropriate architecture for specific use cases
Stage: Layer 1 (Manual Foundation) - Architecture mental models
CEFR Proficiency: B1
New Concepts (count: 3):
- Native Speech-to-Speech architecture
- Cascaded Pipeline architecture
- Latency budget decomposition
Cognitive Load Validation: 3 concepts <= 10 limit (B1) -> WITHIN LIMIT
Maps to Evals: #3 (Architecture Selection), #4 (Latency Budget)
Key Sections:
-
The Two Architectures (~5 min)
- Native S2S: Single model handles audio-in → audio-out
- Cascaded: STT → LLM → TTS pipeline
- Visual comparison: unified model vs component pipeline
- Both are valid - different tradeoffs
-
Native Speech-to-Speech Deep Dive (~5 min)
- OpenAI Realtime API: gpt-realtime model
- Gemini Live API: Gemini 2.5 Flash Native Audio
- How it works: model trained on audio directly
- Latency: 200-300ms end-to-end
- Cost: ~$0.45-0.50 per 4-minute call
- Advantage: natural prosody, emotional intelligence
-
Cascaded Pipeline Deep Dive (~5 min)
- STT (Speech-to-Text): Deepgram Nova-3 (~90ms)
- LLM (Language Model): GPT-4o-mini (200-400ms)
- TTS (Text-to-Speech): Cartesia Sonic-3 (40-90ms)
- Total latency: 330-580ms (varies by component choice)
- Cost: ~$0.03/minute
- Advantage: component flexibility, cost control
-
Latency Budgets: Where Time Goes (~5 min)
- Breakdown: 40ms mic → 90ms STT → 200ms LLM → 75ms TTS
- Why each component matters
- Which component to optimize for your use case
- Human perception: <300ms feels instant, >500ms feels slow
-
Decision Matrix: When to Use Which (~5 min)
- Native S2S when: Premium experience matters, budget allows, emotional intelligence needed
- Cascaded when: Cost-sensitive, need component flexibility, high volume
- Hybrid approaches: S2S for greeting, cascaded for bulk
Duration Estimate: 25 minutes
Try With AI Prompts:
-
Calculate Your Cost Model
I'm planning a voice agent for [describe your use case: customer
support, sales calls, appointment scheduling, etc.]. Help me model
the economics:
- Expected call volume: [X calls/day]
- Average call length: [Y minutes]
- Budget constraints: [describe]
Calculate: Native S2S vs Cascaded costs. At what call volume does
the cost difference become significant? When does S2S quality
justify its premium?What you're learning: Economic reasoning about architecture choices - understanding when quality premiums are justified.
-
Diagnose Latency Problems
I have a voice agent with latency issues - users complain it feels
"slow." The breakdown is:
- STT: 150ms (using Whisper)
- LLM: 400ms (using GPT-4o)
- TTS: 200ms (using ElevenLabs)
- Total: 750ms
Help me diagnose: Which component should I optimize first? What
alternatives exist for each? What's realistic to achieve?What you're learning: Latency debugging methodology - systematic approach to identifying and fixing performance bottlenecks.
-
Architecture Trade-off Analysis
I'm building a voice Digital FTE for [your domain]. I need to
choose between:
Option A: Native S2S (OpenAI Realtime) - 250ms, $0.12/min
Option B: Cascaded (Deepgram + GPT-4o-mini + Cartesia) - 450ms, $0.03/min
Help me think through: What user experience differences will
customers actually notice? Is 200ms difference perceptible? When
does cost savings outweigh quality? What would YOU choose for my
specific use case?What you're learning: Multi-criteria decision making - balancing competing concerns (cost, quality, user experience) for real decisions.
Lesson 3: The Voice AI Technology Stack
Title: The Voice AI Technology Stack
Learning Objectives:
- Identify key STT providers (Deepgram, AssemblyAI, Whisper) and their differentiators
- Compare TTS providers (Cartesia, ElevenLabs, Deepgram Aura) for quality, latency, and cost
- Explain why VAD (Voice Activity Detection) matters and how Silero VAD works
- Distinguish WebRTC vs WebSocket transport protocols and when each applies
Stage: Layer 1 (Manual Foundation) - Technology stack orientation
CEFR Proficiency: B1
New Concepts (count: 2):
- Voice AI technology stack components (STT, TTS, VAD)
- Transport protocols (WebRTC vs WebSocket)
Cognitive Load Validation: 2 concepts <= 10 limit (B1) -> WITHIN LIMIT
Maps to Evals: #5 (Stack Components)
Key Sections:
-
The Component Stack Overview (~3 min)
- Four layers: Transport → VAD → STT/TTS → LLM
- Each component is swappable
- Framework abstraction vs direct component use
-
Speech-to-Text (STT) Providers (~5 min)
- Deepgram Nova-3: ~90ms latency, $0.0077/min, production leader
- AssemblyAI: Strong accuracy, good streaming support
- OpenAI Whisper: Open weights, local deployment option, higher latency
- Decision factors: Latency, cost, accuracy, streaming quality
- The cascade: Which STT for which use case
-
Text-to-Speech (TTS) Providers (~5 min)
- Cartesia Sonic-3: 40-90ms latency, low-cost, high quality
- ElevenLabs: Premium quality, voice cloning, higher cost
- Deepgram Aura: Fast streaming, good for real-time
- PlayHT: Wide voice selection, multilingual
- Decision factors: Voice quality, latency, cloning needs, cost
-
Voice Activity Detection (VAD) (~5 min)
- Why VAD matters: Detect when user is speaking vs silence
- Turn-taking: Knowing when to interrupt, when to wait
- Silero VAD: Industry standard, <1ms latency, free/open
- Semantic turn detection: Beyond acoustic to semantic understanding
- LiveKit's innovation: Transformer-based turn detection
-
Transport Protocols (~5 min)
- WebRTC: Real-time, peer-to-peer, built for voice/video
- Pros: Low latency, NAT traversal, browser-native
- Cons: Complex setup, TURN/STUN servers needed
- WebSocket: Simpler, server-mediated, HTTP-friendly
- Pros: Easy setup, firewall-friendly
- Cons: Higher latency, not optimized for audio
- Decision: WebRTC for production voice, WebSocket for prototyping
- WebRTC: Real-time, peer-to-peer, built for voice/video
-
Putting It Together (~2 min)
- Typical production stack: WebRTC + Silero VAD + Deepgram + GPT-4o + Cartesia
- Economy stack: ~$0.033/minute total
- Premium stack: OpenAI Realtime (~$0.11/minute)
- Framework handles the glue - you choose components
Duration Estimate: 25 minutes
Try With AI Prompts:
-
Build Your Stack
I want to build a voice agent with these requirements:
- Use case: [describe your use case]
- Priority: [cost / quality / latency - pick one primary]
- Volume: [expected calls per day]
- Budget: [max cost per minute]
Help me design the optimal stack. For each component (STT, TTS,
VAD, transport), recommend a specific provider and explain why.
Calculate the total cost per minute.What you're learning: Stack design methodology - translating requirements into component selections.
-
Provider Deep Dive
I'm trying to understand the TTS landscape better. Compare these
three providers for me:
1. Cartesia Sonic-3
2. ElevenLabs
3. Deepgram Aura
For each: What's their unique strength? When would you absolutely
choose them over alternatives? When would you avoid them? Include
specific use cases from production voice agents.What you're learning: Vendor evaluation skills - building nuanced understanding of provider tradeoffs beyond marketing claims.
-
WebRTC vs WebSocket Decision
I'm building a voice interface and debating transport protocol:
My context:
- [Browser-based / Phone-based / Both]
- Team experience: [No WebRTC experience / Some / Expert]
- Timeline: [MVP in 2 weeks / Production in 3 months]
- Scale: [Demo / Thousands of concurrent users]
Help me decide: WebRTC or WebSocket? What's the migration path if
I start with one and need to switch? What will I regret if I
choose wrong?What you're learning: Technology selection under constraints - making pragmatic decisions with incomplete information.
IV. Chapter README
Create a README.md for the chapter folder:
---
sidebar_position: 1
title: "Chapter 79: Voice AI Fundamentals"
---
# Chapter 79: Voice AI Fundamentals
Build mental models before building systems. This chapter prepares you for voice AI implementation by teaching the landscape, architectures, and technology stack.
## What You'll Learn
- Why voice is the natural interface for Digital FTEs
- Two dominant architectures: Native S2S vs Cascaded Pipeline
- When to use which: Decision matrices for real choices
- The technology stack: STT, TTS, VAD, and transport protocols
- Framework positioning: LiveKit Agents vs Pipecat
## Lesson Progression
| Lesson | Title | Duration | Focus |
|--------|-------|----------|-------|
| 1 | The Voice AI Landscape | 20 min | Market, frameworks, positioning |
| 2 | Voice AI Architectures | 25 min | S2S vs Cascaded, latency budgets |
| 3 | The Voice AI Technology Stack | 25 min | STT, TTS, VAD, transport protocols |
## Prerequisites
This chapter requires:
- Part 6 (AI Native): Agent APIs you'll add voice to
- Part 7 (Cloud Native): Deployment infrastructure
- Part 9 (TypeScript): Async patterns, WebSocket communication
- Part 10 (Frontends): Chat UIs you'll extend with voice
## What's Next
After this conceptual foundation:
- **Chapter 80**: LiveKit Agents - Build with the framework powering ChatGPT Voice
- **Chapter 81**: Pipecat - Build with frame-based pipeline flexibility
- **Chapter 82**: OpenAI Realtime API - Direct S2S access
- **Chapter 83**: Gemini Live API - Multimodal voice + vision
V. Validation Checklist
Chapter-Level Validation:
- Chapter type identified: CONCEPTUAL (no code, mental models)
- Concept density analysis documented: 8 concepts across 3 lessons
- Lesson count justified: 3 lessons (~2.7 concepts each, well within B1 limit)
- All evals covered by lessons
- All lessons map to at least one eval
Stage Progression Validation:
- All lessons are Layer 1 (Manual Foundation) - appropriate for conceptual chapter
- No premature AI collaboration (Layer 2 comes in Chapters 80+)
- No skill creation (Layer 3 comes in Chapters 80, 81, 84)
- No spec-driven content (Layer 4 is Chapter 85 Capstone)
Cognitive Load Validation:
- Lesson 1: 3 concepts <= 10 (B1 limit) PASS
- Lesson 2: 3 concepts <= 10 (B1 limit) PASS
- Lesson 3: 2 concepts <= 10 (B1 limit) PASS
Conceptual Chapter Requirements:
- Essay-style sections (not code-focused lessons)
- Decision frameworks provided (architecture selection, framework choice)
- "Try With AI" prompts are exploratory (not coding exercises)
- Prepares students for technical chapters that follow
VI. File Structure
61-voice-ai-fundamentals/
├── _category_.json # Existing
├── README.md # Chapter overview (create)
├── 01-voice-ai-landscape.md # Lesson 1 (create)
├── 02-voice-ai-architectures.md # Lesson 2 (create)
├── 03-voice-technology-stack.md # Lesson 3 (create)
└── 04-chapter-quiz.md # Assessment (create)
VII. Summary
Chapter 79: Voice AI Fundamentals is a 3-lesson conceptual chapter that builds mental models for voice AI before implementation:
| Lesson | Title | Concepts | Duration | Evals |
|---|---|---|---|---|
| 1 | The Voice AI Landscape | 3 | 20 min | #1, #2 |
| 2 | Voice AI Architectures | 3 | 25 min | #3, #4 |
| 3 | The Voice AI Technology Stack | 2 | 25 min | #5 |
Total: 8 concepts, 70 minutes, prepares students for Chapters 80-85 implementation.