Skip to main content

Chapter 79: Voice AI Fundamentals - Lesson Plan

Generated by: chapter-planner v2.0.0 (Reasoning-Activated) Source: Part 11 README, Deep Search Report Created: 2026-01-01 Constitution: v6.0.0 (Reasoning Mode)


I. Chapter Analysis

Chapter Type

CONCEPTUAL - This chapter builds mental models before building systems. No code implementation, no skill creation. Students need to understand the landscape, architectures, and technology stack before Chapters 80-85 introduce implementation.

Recognition signals:

  • Learning objectives use "understand/explain/compare/evaluate"
  • No code examples required
  • Focus on decision frameworks and mental models
  • Prepares students for technical chapters that follow

Concept Density Analysis

Core Concepts (from Part 11 README + Deep Search): 8 concepts

  1. Voice AI market reality and opportunity
  2. Framework-first vs API-first approaches
  3. LiveKit vs Pipecat positioning
  4. Native Speech-to-Speech (S2S) architecture
  5. Cascaded Pipeline (STT → LLM → TTS) architecture
  6. Latency budgets and tradeoffs
  7. Voice technology stack (STT, TTS, VAD providers)
  8. Transport protocols (WebRTC vs WebSocket)

Complexity Assessment: Standard (conceptual with decision frameworks)

Proficiency Tier: B1 (Part 11 requires Parts 6, 7, 9, 10 completed)

Justified Lesson Count: 3 lessons

  • Lesson 1: Landscape orientation (concepts 1-3)
  • Lesson 2: Architecture deep-dive (concepts 4-6)
  • Lesson 3: Technology stack (concepts 7-8)

Reasoning: 8 concepts across 3 lessons = ~2.7 concepts per lesson, well within B1 limit of 10. Conceptual chapters benefit from focused, shorter lessons that build mental models progressively.


II. Success Evals (Derived from Part 11 README)

Success Criteria (what students must achieve):

  1. Landscape Understanding: Students can explain why voice matters for Digital FTEs and articulate the $5.4B → $47.5B market opportunity
  2. Framework Decision: Students can compare LiveKit Agents vs Pipecat and justify which to use for a given scenario
  3. Architecture Selection: Students can explain when to use Native S2S ($0.45/4min, 200-300ms) vs Cascaded Pipeline ($0.03/min, 500-800ms)
  4. Latency Budget: Students can articulate the latency budget breakdown (40ms mic → 90ms STT → 200ms LLM → 75ms TTS)
  5. Stack Components: Students can name the key providers (Deepgram Nova-3, Cartesia Sonic-3, Silero VAD) and their roles

All lessons below map to these evals.


III. Lesson Sequence


Lesson 1: The Voice AI Landscape

Title: The Voice AI Landscape

Learning Objectives:

  • Explain why voice is the natural interface for Digital FTEs (accessibility, 24/7 availability, emotional connection)
  • Articulate the market reality ($5.4B 2024 → $47.5B 2034) and what it signals for career investment
  • Compare framework-first (LiveKit, Pipecat) vs API-first (raw OpenAI Realtime) approaches and identify when each applies
  • Evaluate LiveKit Agents vs Pipecat positioning using concrete differentiators (ChatGPT Voice, SIP support, 40+ integrations)

Stage: Layer 1 (Manual Foundation) - First exposure to voice AI concepts

CEFR Proficiency: B1

New Concepts (count: 3):

  1. Voice as Digital FTE interface (why voice, not just text)
  2. Framework-first vs API-first thinking
  3. LiveKit vs Pipecat strategic positioning

Cognitive Load Validation: 3 concepts <= 10 limit (B1) -> WITHIN LIMIT

Maps to Evals: #1 (Landscape Understanding), #2 (Framework Decision)

Key Sections:

  1. Why Voice Changes Everything for AI (~5 min)

    • Voice is natural: humans evolved to speak, not type
    • 24/7 availability (voice agents answer phones at 3 AM)
    • Hands-free interaction (driving, cooking, working)
    • Emotional connection (tone, empathy, urgency)
    • Accessibility (serves users who cannot type)
  2. The Market Reality (~5 min)

    • $5.4B (2024) → $47.5B (2034) market projection
    • LiveKit Agents powers ChatGPT's Advanced Voice Mode
    • Pipecat integrates 40+ AI services
    • Infrastructure is production-ready, not experimental
  3. Framework-First Thinking (~5 min)

    • Traditional teaching: raw APIs, "three-model pipeline"
    • Modern reality: frameworks abstract complexity
    • Decision framework: Use frameworks for production, APIs for edge cases
    • Why we teach frameworks first, then direct APIs
  4. LiveKit vs Pipecat: Strategic Positioning (~5 min)

    • LiveKit Agents (8,200+ stars): Powers ChatGPT Voice, native SIP, semantic turn detection
    • Pipecat (8,900+ stars): 40+ integrations, frame-based pipeline, vendor neutral
    • Decision matrix: When to choose which
    • Not competitors - different philosophies for different needs

Duration Estimate: 20 minutes

Try With AI Prompts:

  1. Explore Your Voice AI Position

    I'm learning about voice AI for Digital FTEs. I currently work in
    [your field/industry]. Help me understand:

    1. What voice interactions already exist in my field?
    2. Where are the gaps - tasks that SHOULD be voice but aren't?
    3. What would a voice-enabled Digital FTE look like for my domain?

    Ask me clarifying questions about my specific workflows.

    What you're learning: Domain translation - connecting abstract market trends to your specific professional context.

  2. Stress-Test the Framework-First Claim

    I've been told to "use frameworks first, raw APIs for edge cases"
    in voice AI. Help me challenge this:

    1. When would raw API access actually be NECESSARY?
    2. What do you lose by using frameworks vs direct APIs?
    3. Are there scenarios where framework abstraction hurts more than helps?

    Give me specific technical scenarios, not just abstract principles.

    What you're learning: Critical evaluation of pedagogical claims - understanding the tradeoffs behind teaching decisions.

  3. Compare Framework Philosophies

    LiveKit Agents and Pipecat both build voice AI, but they have
    different philosophies:

    - LiveKit: "Powers ChatGPT Voice, native SIP, semantic turn detection"
    - Pipecat: "40+ integrations, frame-based pipeline, vendor neutral"

    Help me understand: What KIND of project would make LiveKit the
    obvious choice? What KIND would make Pipecat better? Ask me about
    what I'm trying to build so we can figure out which fits my needs.

    What you're learning: Decision framework construction - building intuition for technology selection through concrete scenarios.


Lesson 2: Voice AI Architectures

Title: Voice AI Architectures

Learning Objectives:

  • Explain how Native Speech-to-Speech (OpenAI Realtime, Gemini Live) achieves 200-300ms latency
  • Describe the Cascaded Pipeline (STT → LLM → TTS) architecture and its latency budget breakdown
  • Compare cost/latency tradeoffs: S2S ($0.45/4min) vs Cascaded ($0.03/min)
  • Apply decision matrix to select appropriate architecture for specific use cases

Stage: Layer 1 (Manual Foundation) - Architecture mental models

CEFR Proficiency: B1

New Concepts (count: 3):

  1. Native Speech-to-Speech architecture
  2. Cascaded Pipeline architecture
  3. Latency budget decomposition

Cognitive Load Validation: 3 concepts <= 10 limit (B1) -> WITHIN LIMIT

Maps to Evals: #3 (Architecture Selection), #4 (Latency Budget)

Key Sections:

  1. The Two Architectures (~5 min)

    • Native S2S: Single model handles audio-in → audio-out
    • Cascaded: STT → LLM → TTS pipeline
    • Visual comparison: unified model vs component pipeline
    • Both are valid - different tradeoffs
  2. Native Speech-to-Speech Deep Dive (~5 min)

    • OpenAI Realtime API: gpt-realtime model
    • Gemini Live API: Gemini 2.5 Flash Native Audio
    • How it works: model trained on audio directly
    • Latency: 200-300ms end-to-end
    • Cost: ~$0.45-0.50 per 4-minute call
    • Advantage: natural prosody, emotional intelligence
  3. Cascaded Pipeline Deep Dive (~5 min)

    • STT (Speech-to-Text): Deepgram Nova-3 (~90ms)
    • LLM (Language Model): GPT-4o-mini (200-400ms)
    • TTS (Text-to-Speech): Cartesia Sonic-3 (40-90ms)
    • Total latency: 330-580ms (varies by component choice)
    • Cost: ~$0.03/minute
    • Advantage: component flexibility, cost control
  4. Latency Budgets: Where Time Goes (~5 min)

    • Breakdown: 40ms mic → 90ms STT → 200ms LLM → 75ms TTS
    • Why each component matters
    • Which component to optimize for your use case
    • Human perception: <300ms feels instant, >500ms feels slow
  5. Decision Matrix: When to Use Which (~5 min)

    • Native S2S when: Premium experience matters, budget allows, emotional intelligence needed
    • Cascaded when: Cost-sensitive, need component flexibility, high volume
    • Hybrid approaches: S2S for greeting, cascaded for bulk

Duration Estimate: 25 minutes

Try With AI Prompts:

  1. Calculate Your Cost Model

    I'm planning a voice agent for [describe your use case: customer
    support, sales calls, appointment scheduling, etc.]. Help me model
    the economics:

    - Expected call volume: [X calls/day]
    - Average call length: [Y minutes]
    - Budget constraints: [describe]

    Calculate: Native S2S vs Cascaded costs. At what call volume does
    the cost difference become significant? When does S2S quality
    justify its premium?

    What you're learning: Economic reasoning about architecture choices - understanding when quality premiums are justified.

  2. Diagnose Latency Problems

    I have a voice agent with latency issues - users complain it feels
    "slow." The breakdown is:

    - STT: 150ms (using Whisper)
    - LLM: 400ms (using GPT-4o)
    - TTS: 200ms (using ElevenLabs)
    - Total: 750ms

    Help me diagnose: Which component should I optimize first? What
    alternatives exist for each? What's realistic to achieve?

    What you're learning: Latency debugging methodology - systematic approach to identifying and fixing performance bottlenecks.

  3. Architecture Trade-off Analysis

    I'm building a voice Digital FTE for [your domain]. I need to
    choose between:

    Option A: Native S2S (OpenAI Realtime) - 250ms, $0.12/min
    Option B: Cascaded (Deepgram + GPT-4o-mini + Cartesia) - 450ms, $0.03/min

    Help me think through: What user experience differences will
    customers actually notice? Is 200ms difference perceptible? When
    does cost savings outweigh quality? What would YOU choose for my
    specific use case?

    What you're learning: Multi-criteria decision making - balancing competing concerns (cost, quality, user experience) for real decisions.


Lesson 3: The Voice AI Technology Stack

Title: The Voice AI Technology Stack

Learning Objectives:

  • Identify key STT providers (Deepgram, AssemblyAI, Whisper) and their differentiators
  • Compare TTS providers (Cartesia, ElevenLabs, Deepgram Aura) for quality, latency, and cost
  • Explain why VAD (Voice Activity Detection) matters and how Silero VAD works
  • Distinguish WebRTC vs WebSocket transport protocols and when each applies

Stage: Layer 1 (Manual Foundation) - Technology stack orientation

CEFR Proficiency: B1

New Concepts (count: 2):

  1. Voice AI technology stack components (STT, TTS, VAD)
  2. Transport protocols (WebRTC vs WebSocket)

Cognitive Load Validation: 2 concepts <= 10 limit (B1) -> WITHIN LIMIT

Maps to Evals: #5 (Stack Components)

Key Sections:

  1. The Component Stack Overview (~3 min)

    • Four layers: Transport → VAD → STT/TTS → LLM
    • Each component is swappable
    • Framework abstraction vs direct component use
  2. Speech-to-Text (STT) Providers (~5 min)

    • Deepgram Nova-3: ~90ms latency, $0.0077/min, production leader
    • AssemblyAI: Strong accuracy, good streaming support
    • OpenAI Whisper: Open weights, local deployment option, higher latency
    • Decision factors: Latency, cost, accuracy, streaming quality
    • The cascade: Which STT for which use case
  3. Text-to-Speech (TTS) Providers (~5 min)

    • Cartesia Sonic-3: 40-90ms latency, low-cost, high quality
    • ElevenLabs: Premium quality, voice cloning, higher cost
    • Deepgram Aura: Fast streaming, good for real-time
    • PlayHT: Wide voice selection, multilingual
    • Decision factors: Voice quality, latency, cloning needs, cost
  4. Voice Activity Detection (VAD) (~5 min)

    • Why VAD matters: Detect when user is speaking vs silence
    • Turn-taking: Knowing when to interrupt, when to wait
    • Silero VAD: Industry standard, <1ms latency, free/open
    • Semantic turn detection: Beyond acoustic to semantic understanding
    • LiveKit's innovation: Transformer-based turn detection
  5. Transport Protocols (~5 min)

    • WebRTC: Real-time, peer-to-peer, built for voice/video
      • Pros: Low latency, NAT traversal, browser-native
      • Cons: Complex setup, TURN/STUN servers needed
    • WebSocket: Simpler, server-mediated, HTTP-friendly
      • Pros: Easy setup, firewall-friendly
      • Cons: Higher latency, not optimized for audio
    • Decision: WebRTC for production voice, WebSocket for prototyping
  6. Putting It Together (~2 min)

    • Typical production stack: WebRTC + Silero VAD + Deepgram + GPT-4o + Cartesia
    • Economy stack: ~$0.033/minute total
    • Premium stack: OpenAI Realtime (~$0.11/minute)
    • Framework handles the glue - you choose components

Duration Estimate: 25 minutes

Try With AI Prompts:

  1. Build Your Stack

    I want to build a voice agent with these requirements:

    - Use case: [describe your use case]
    - Priority: [cost / quality / latency - pick one primary]
    - Volume: [expected calls per day]
    - Budget: [max cost per minute]

    Help me design the optimal stack. For each component (STT, TTS,
    VAD, transport), recommend a specific provider and explain why.
    Calculate the total cost per minute.

    What you're learning: Stack design methodology - translating requirements into component selections.

  2. Provider Deep Dive

    I'm trying to understand the TTS landscape better. Compare these
    three providers for me:

    1. Cartesia Sonic-3
    2. ElevenLabs
    3. Deepgram Aura

    For each: What's their unique strength? When would you absolutely
    choose them over alternatives? When would you avoid them? Include
    specific use cases from production voice agents.

    What you're learning: Vendor evaluation skills - building nuanced understanding of provider tradeoffs beyond marketing claims.

  3. WebRTC vs WebSocket Decision

    I'm building a voice interface and debating transport protocol:

    My context:
    - [Browser-based / Phone-based / Both]
    - Team experience: [No WebRTC experience / Some / Expert]
    - Timeline: [MVP in 2 weeks / Production in 3 months]
    - Scale: [Demo / Thousands of concurrent users]

    Help me decide: WebRTC or WebSocket? What's the migration path if
    I start with one and need to switch? What will I regret if I
    choose wrong?

    What you're learning: Technology selection under constraints - making pragmatic decisions with incomplete information.


IV. Chapter README

Create a README.md for the chapter folder:

---
sidebar_position: 1
title: "Chapter 79: Voice AI Fundamentals"
---

# Chapter 79: Voice AI Fundamentals

Build mental models before building systems. This chapter prepares you for voice AI implementation by teaching the landscape, architectures, and technology stack.

## What You'll Learn

- Why voice is the natural interface for Digital FTEs
- Two dominant architectures: Native S2S vs Cascaded Pipeline
- When to use which: Decision matrices for real choices
- The technology stack: STT, TTS, VAD, and transport protocols
- Framework positioning: LiveKit Agents vs Pipecat

## Lesson Progression

| Lesson | Title | Duration | Focus |
|--------|-------|----------|-------|
| 1 | The Voice AI Landscape | 20 min | Market, frameworks, positioning |
| 2 | Voice AI Architectures | 25 min | S2S vs Cascaded, latency budgets |
| 3 | The Voice AI Technology Stack | 25 min | STT, TTS, VAD, transport protocols |

## Prerequisites

This chapter requires:
- Part 6 (AI Native): Agent APIs you'll add voice to
- Part 7 (Cloud Native): Deployment infrastructure
- Part 9 (TypeScript): Async patterns, WebSocket communication
- Part 10 (Frontends): Chat UIs you'll extend with voice

## What's Next

After this conceptual foundation:
- **Chapter 80**: LiveKit Agents - Build with the framework powering ChatGPT Voice
- **Chapter 81**: Pipecat - Build with frame-based pipeline flexibility
- **Chapter 82**: OpenAI Realtime API - Direct S2S access
- **Chapter 83**: Gemini Live API - Multimodal voice + vision

V. Validation Checklist

Chapter-Level Validation:

  • Chapter type identified: CONCEPTUAL (no code, mental models)
  • Concept density analysis documented: 8 concepts across 3 lessons
  • Lesson count justified: 3 lessons (~2.7 concepts each, well within B1 limit)
  • All evals covered by lessons
  • All lessons map to at least one eval

Stage Progression Validation:

  • All lessons are Layer 1 (Manual Foundation) - appropriate for conceptual chapter
  • No premature AI collaboration (Layer 2 comes in Chapters 80+)
  • No skill creation (Layer 3 comes in Chapters 80, 81, 84)
  • No spec-driven content (Layer 4 is Chapter 85 Capstone)

Cognitive Load Validation:

  • Lesson 1: 3 concepts <= 10 (B1 limit) PASS
  • Lesson 2: 3 concepts <= 10 (B1 limit) PASS
  • Lesson 3: 2 concepts <= 10 (B1 limit) PASS

Conceptual Chapter Requirements:

  • Essay-style sections (not code-focused lessons)
  • Decision frameworks provided (architecture selection, framework choice)
  • "Try With AI" prompts are exploratory (not coding exercises)
  • Prepares students for technical chapters that follow

VI. File Structure

61-voice-ai-fundamentals/
├── _category_.json # Existing
├── README.md # Chapter overview (create)
├── 01-voice-ai-landscape.md # Lesson 1 (create)
├── 02-voice-ai-architectures.md # Lesson 2 (create)
├── 03-voice-technology-stack.md # Lesson 3 (create)
└── 04-chapter-quiz.md # Assessment (create)

VII. Summary

Chapter 79: Voice AI Fundamentals is a 3-lesson conceptual chapter that builds mental models for voice AI before implementation:

LessonTitleConceptsDurationEvals
1The Voice AI Landscape320 min#1, #2
2Voice AI Architectures325 min#3, #4
3The Voice AI Technology Stack225 min#5

Total: 8 concepts, 70 minutes, prepares students for Chapters 80-85 implementation.