Skip to main content

Chapter 81: Pipecat - Lesson Plan

Generated by: chapter-planner v2.0.0 (Reasoning-Activated) Source: Part 11 README, Deep Search Report, Chapter 80 context Created: 2026-01-01 Constitution: v6.0.0 (Reasoning Mode)


I. Chapter Analysis

Chapter Type

TECHNICAL (SKILL-FIRST L00 Pattern) - This chapter uses the Skill-First Learning pattern where students build the pipecat skill FIRST from official documentation, then learn the framework by improving that skill across subsequent lessons.

Recognition signals:

  • Learning objectives use "implement/create/build"
  • Code examples required for every lesson
  • Skill artifact created in Lesson 0
  • Subsequent lessons TEST and IMPROVE the skill
  • Follows L00 pattern established in Parts 5-7 and Chapter 80

Concept Density Analysis

Core Concepts (from Deep Search + Part 11 README): 9 concepts

  1. LEARNING-SPEC.md (skill specification before building)
  2. Frame-based pipeline architecture (core abstraction)
  3. Frame types: AudioRawFrame, TextFrame, EndFrame, control signals
  4. Processors: Transformations on frame streams
  5. Pipelines: Composition of processors
  6. Transport abstraction (Daily WebRTC, FastAPI WebSocket, local audio)
  7. Provider plugins (40+ integrations: STT, LLM, TTS providers)
  8. Speech-to-Speech integration (OpenAI Realtime, Gemini Live, Nova Sonic)
  9. Custom processor implementation

Complexity Assessment: Standard (framework with clear abstractions, modular design)

Proficiency Tier: B1-B2 (Part 11 requires Parts 6, 7, 9, 10 completed; students have production experience)

Justified Lesson Count: 3 lessons

  • Lesson 0: Build Your Pipecat Skill (L00 pattern)
  • Lesson 1: Frame-Based Pipeline Architecture (Layer 2: AI Collaboration)
  • Lesson 2: Multi-Provider Integration & Custom Processors (Layer 2 + Layer 3)

Reasoning:

  • 9 concepts across 3 lessons = 3 concepts per lesson
  • B1-B2 limit is 10 concepts per lesson - well within limit
  • Pipecat is conceptually simpler than LiveKit (modular vs distributed)
  • Chapter 80 covered similar voice AI fundamentals, reducing cognitive load
  • 3 lessons sufficient for skill-first framework mastery

II. Success Evals (from Part 11 README + Chapter 81 Description)

Success Criteria (what students must achieve):

  1. Skill Creation: Students build a working pipecat skill from official documentation using /fetching-library-docs
  2. Frame Understanding: Students can explain frames as the data unit flowing through pipelines
  3. Pipeline Architecture: Students implement voice agents with processor composition
  4. Transport Flexibility: Students configure different transports (Daily, WebSocket, local)
  5. Provider Integration: Students connect multiple STT/LLM/TTS providers via plugins
  6. S2S Integration: Students use OpenAI Realtime, Gemini Live, or Nova Sonic through Pipecat
  7. Custom Processors: Students implement custom processors for domain-specific transformations

All lessons below map to these evals.


III. Lesson Sequence


Lesson 0: Build Your Pipecat Skill

Title: Build Your Pipecat Skill

Learning Objectives:

  • Write a LEARNING-SPEC.md that defines what you want to learn about Pipecat
  • Fetch official Pipecat documentation using /fetching-library-docs
  • Create a pipecat skill grounded in official documentation (not AI memory)
  • Verify the skill works by building a minimal voice agent pipeline

Stage: Layer 1 (Manual Foundation) + Layer 2 (AI Collaboration via /skill-creator)

CEFR Proficiency: B1

New Concepts (count: 2):

  1. LEARNING-SPEC.md (specification before skill creation)
  2. Documentation-grounded skill creation

Cognitive Load Validation: 2 concepts <= 10 limit (B1) -> WITHIN LIMIT

Maps to Evals: #1 (Skill Creation)

Key Sections:

  1. Clone the Skills Lab Fresh (~3 min)

    • Why fresh clone: No state assumptions from previous work
    • Command: git clone [skills-lab-repo] && cd skills-lab
    • Verify clean environment
  2. Write Your LEARNING-SPEC.md (~7 min)

    • What is LEARNING-SPEC.md: Your specification for what you want to learn
    • Template structure:
      # Learning Specification: Pipecat

      ## What I Want to Learn
      - How Pipecat's frame-based pipeline works
      - How to compose processors into voice agents
      - How to integrate multiple AI providers
      - How to use S2S models through Pipecat

      ## Why This Matters
      - Pipecat has 40+ provider integrations
      - Frame-based design enables custom transformations
      - Transport-agnostic means flexible deployment

      ## Success Criteria
      - [ ] Skill can scaffold a basic voice pipeline
      - [ ] Skill explains frame types and flow
      - [ ] Skill guides multi-provider configuration
      - [ ] Skill includes custom processor patterns
    • Write YOUR specification (not a copy)
  3. Fetch Official Documentation (~5 min)

    • Use /fetching-library-docs to get Pipecat docs
    • Why official docs: AI memory is unreliable for API details
    • What to look for: Frame architecture, processors, transports, plugins
    • Save relevant excerpts for skill creation
  4. Create Your Skill with /skill-creator (~10 min)

    • Invoke /skill-creator with your LEARNING-SPEC.md and fetched docs
    • Skill structure: Persona + Questions + Principles
    • Review generated skill for accuracy
    • Commit to .claude/skills/pipecat/SKILL.md
  5. Verify Your Skill Works (~5 min)

    • Test: "Create a minimal Pipecat voice pipeline"
    • Verify generated code matches official patterns
    • If issues found: Improve skill and re-test
    • Skill is now your knowledge artifact

Duration Estimate: 30 minutes

File Output: .claude/skills/pipecat/SKILL.md

Prerequisites:

  • Part 10 completed (chat interfaces)
  • Chapter 79 completed (voice AI fundamentals)
  • Chapter 80 recommended (LiveKit comparison context)

Try With AI Prompts:

  1. Draft Your LEARNING-SPEC.md

    I'm about to learn Pipecat, a frame-based voice AI framework with
    40+ provider integrations. Help me write a LEARNING-SPEC.md:

    My context:
    - I just learned LiveKit Agents in Chapter 80
    - I want to understand Pipecat's different approach (frames vs jobs)
    - My goal is flexibility in provider selection for my Digital FTEs

    Help me define:
    1. What specific aspects of Pipecat should I focus on?
    2. How does it differ from LiveKit (what's unique)?
    3. What success criteria would prove I've learned it?

    Make this MY specification, not a generic template.

    What you're learning: Comparative specification - defining learning goals relative to what you already know.

  2. Analyze the Official Docs

    I fetched Pipecat documentation. Here are the key sections:
    [paste relevant excerpts from /fetching-library-docs output]

    Help me understand:
    1. What are frames? How do they flow through pipelines?
    2. What patterns appear repeatedly in the examples?
    3. What makes Pipecat different from LiveKit's approach?
    4. What should my skill definitely include?

    I want to build a GROUNDED skill, not one based on assumptions.

    What you're learning: Documentation analysis - extracting unique patterns from primary sources.

  3. Review Your Generated Skill

    Here's the skill /skill-creator generated:
    [paste SKILL.md content]

    Compare this to the official documentation:
    1. Does it accurately represent Pipecat's frame architecture?
    2. Are there any claims not supported by the docs?
    3. How does it compare to my livekit-agents skill?
    4. What should be removed as incorrect or speculative?

    Help me make this skill ACCURATE and DISTINCT from LiveKit.

    What you're learning: Validation - ensuring AI-generated content matches authoritative sources.


Lesson 1: Frame-Based Pipeline Architecture

Title: Frame-Based Pipeline Architecture

Learning Objectives:

  • Explain frames as the fundamental data unit in Pipecat pipelines
  • Distinguish frame types: AudioRawFrame, TextFrame, EndFrame, control signals
  • Implement processors that transform frame streams
  • Compose processors into complete voice pipelines
  • Configure different transports (Daily WebRTC, WebSocket, local)

Stage: Layer 2 (AI Collaboration) - Use skill to build, improve skill based on learnings

CEFR Proficiency: B1

New Concepts (count: 4):

  1. Frame-based architecture (core abstraction)
  2. Frame types and their purposes
  3. Processors and pipelines
  4. Transport abstraction

Cognitive Load Validation: 4 concepts <= 10 limit (B1) -> WITHIN LIMIT

Maps to Evals: #2 (Frame Understanding), #3 (Pipeline Architecture), #4 (Transport Flexibility)

Key Sections:

  1. The Frame Abstraction (~7 min)

    • What is a frame: The data unit flowing through pipelines
    • Why frames: Uniform interface for diverse data types
    • Frame lifecycle: Creation, transformation, consumption
    • Comparison to LiveKit: Jobs vs Frames mental model
    • Diagram: Frame flow through pipeline
  2. Frame Types (~8 min)

    • AudioRawFrame: Raw audio data (samples, sample rate, channels)
    • TextFrame: Transcribed text or LLM responses
    • EndFrame: Signals end of conversation/stream
    • Control Frames: StartInterruptionFrame, StopInterruptionFrame
    • System Frames: Lifecycle management, pipeline control
    • Code: Working with different frame types
  3. Processors: The Building Blocks (~10 min)

    • What processors do: Transform input frames to output frames
    • Processor interface: async process_frame method
    • Examples: STT processor (Audio -> Text), LLM processor (Text -> Text)
    • Chaining: Output of one becomes input of next
    • Code: Implementing a basic processor
  4. Pipelines: Composing Processors (~8 min)

    • Pipeline construction: List of processors in order
    • Frame routing: How frames flow through
    • Parallel pipelines: Multiple processing paths
    • Error handling: What happens when processor fails
    • Code: Building a complete voice pipeline
  5. Transport Abstraction (~7 min)

    • Transport role: Audio I/O to/from the outside world
    • Daily WebRTC: Browser-based realtime communication
    • FastAPI WebSocket: Custom backend integration
    • Local Audio: Microphone/speaker for testing
    • Code: Configuring different transports
  6. Improve Your Skill (~5 min)

    • Reflect: What frame patterns did you learn?
    • Update .claude/skills/pipecat/SKILL.md
    • Add: Frame type guidance, processor patterns
    • Test: Does improved skill generate better code?

Duration Estimate: 45 minutes

Three Roles Integration (Layer 2):

AI as Teacher:

  • Skill explains frame lifecycle patterns you didn't know
  • "Control frames propagate immediately, bypassing queued frames"

AI as Student:

  • You refine skill's transport explanation based on your deployment needs
  • "Add WebSocket transport pattern for my Next.js frontend"

AI as Co-Worker:

  • Iterate on pipeline composition together
  • First attempt misses error handling -> AI suggests try/except pattern -> you validate

Try With AI Prompts:

  1. Understand the Frame Abstraction

    I'm learning Pipecat's frame-based architecture. Coming from LiveKit's
    job-based model, help me understand:

    1. What's a frame? How is it different from a LiveKit job?
    2. What frame types exist and when do I use each?
    3. How do frames flow through a pipeline?
    4. What happens when a frame reaches the end?

    Use diagrams or pseudocode to clarify the flow.

    What you're learning: Mental model translation - mapping new concepts to familiar ones.

  2. Build a Processor Chain

    Help me build a complete voice pipeline using my pipecat skill:

    Requirements:
    - Transport: Daily WebRTC (for browser testing)
    - STT: Deepgram Nova-3
    - LLM: GPT-4o-mini
    - TTS: Cartesia Sonic

    Walk me through each processor and how frames flow between them.
    After we build it, I'll test and report what works.

    What you're learning: Processor composition - building systems from modular components.

  3. Compare Transport Options

    I need to choose the right transport for my use case:

    Scenario A: Browser-based voice agent for customer support
    Scenario B: CLI tool for voice interaction during development
    Scenario C: WebSocket integration with existing FastAPI backend

    Use my pipecat skill to recommend transports for each and explain
    the tradeoffs. I'll implement one and report back.

    What you're learning: Transport selection - matching infrastructure to requirements.


Lesson 2: Multi-Provider Integration & Custom Processors

Title: Multi-Provider Integration & Custom Processors

Learning Objectives:

  • Configure multiple STT/LLM/TTS providers via Pipecat's plugin system
  • Integrate speech-to-speech models (OpenAI Realtime, Gemini Live, Nova Sonic)
  • Implement custom processors for domain-specific transformations
  • Finalize pipecat skill for production use

Stage: Layer 2 (AI Collaboration) + Layer 3 (Intelligence Design)

CEFR Proficiency: B1-B2

New Concepts (count: 3):

  1. Provider plugins (40+ integrations)
  2. S2S model integration
  3. Custom processor implementation

Cognitive Load Validation: 3 concepts <= 10 limit (B1-B2) -> WITHIN LIMIT

Maps to Evals: #5 (Provider Integration), #6 (S2S Integration), #7 (Custom Processors)

Key Sections:

  1. The Plugin Ecosystem (~8 min)

    • Pipecat's 40+ provider integrations
    • Plugin categories: STT, LLM, TTS, Transport, Vision
    • Installation: pip install pipecat-ai[provider]
    • Provider comparison: Latency, cost, quality tradeoffs
    • Table: Key providers and their strengths
  2. Swapping Providers (~10 min)

    • The modular advantage: Change one processor, keep the pipeline
    • STT providers: Deepgram, Whisper, AssemblyAI, Gladia
    • LLM providers: OpenAI, Anthropic, Google, Together, local
    • TTS providers: Cartesia, ElevenLabs, Azure, Deepgram Aura
    • Code: Switching from Deepgram to Whisper with one line
  3. Speech-to-Speech Integration (~12 min)

    • Why S2S: Native voice understanding + generation
    • OpenAI Realtime: via RTVIProcessor
    • Gemini Live: via GeminiMultimodalLive
    • AWS Nova Sonic: via Nova plugin
    • When to use S2S vs cascaded pipeline
    • Code: Configuring OpenAI Realtime through Pipecat
  4. Custom Processors (~10 min)

    • When to customize: Domain-specific transformations
    • Processor base class: FrameProcessor
    • Example: Sentiment analysis processor (Text -> Emotion + Text)
    • Example: Translation processor (Text -> Translated Text)
    • Example: Content filter (blocks inappropriate content)
    • Code: Implementing a custom processor
  5. Finalize Your Skill (~5 min)

    • Complete skill review: Does it cover all learnings?
    • Add: Provider selection guidance
    • Add: Custom processor patterns
    • Final test: Use skill to scaffold a multi-provider voice system
    • Commit: Production-ready skill artifact

Duration Estimate: 45 minutes

Three Roles Integration (Layer 2 + Layer 3):

AI as Teacher:

  • Skill guides provider selection for your use case
  • "For realtime transcription, Deepgram Nova-3 has 90ms latency vs Whisper's 300ms"

AI as Student:

  • You teach skill your domain constraints
  • "I need HIPAA compliance - add provider filtering"

AI as Co-Worker:

  • Design custom processor together
  • You specify transformation logic -> AI generates code -> you validate behavior

Skill Finalization: At lesson end, students have a production-ready pipecat skill that:

  • Scaffolds voice pipelines with frame-based architecture
  • Guides provider selection across 40+ integrations
  • Supports S2S model configuration
  • Includes custom processor patterns

Try With AI Prompts:

  1. Choose the Right Providers

    I need to build a voice agent with these constraints:

    - Latency: Under 500ms total response time
    - Cost: Under $0.05 per minute
    - Quality: Natural-sounding voice, accurate transcription
    - Region: Must support EU data residency

    Use my pipecat skill to recommend:
    1. Which STT provider?
    2. Which LLM provider?
    3. Which TTS provider?

    Explain the tradeoffs and alternatives for each.

    What you're learning: Provider selection - balancing latency, cost, quality, compliance.

  2. Integrate Speech-to-Speech

    I want to try OpenAI's Realtime API through Pipecat instead of
    building my own pipeline. Help me:

    1. Configure RTVIProcessor for OpenAI Realtime
    2. Understand what I lose vs the cascaded approach
    3. Understand what I gain (latency, naturalness)
    4. Set up function calling through the S2S model

    Use my pipecat skill. I'll test and report latency numbers.

    What you're learning: S2S integration - using native voice models through framework abstraction.

  3. Build a Custom Processor

    I need a custom processor that:

    1. Receives TextFrame from STT
    2. Detects if user is asking about sensitive topics (medical, legal)
    3. If sensitive: Adds disclaimer frame before LLM response
    4. If not sensitive: Passes through unchanged

    Help me implement this using my pipecat skill. Walk through:
    - Processor class structure
    - Frame handling logic
    - Testing approach

    What you're learning: Custom processor implementation - extending Pipecat for domain needs.


IV. Skill Dependency Graph

Skill Dependencies:

Lesson 0: Build Skill (foundation)
|
Lesson 1: Frame Architecture (requires skill)
|
Lesson 2: Multi-Provider + Custom Processors (requires frame understanding)

Cross-Chapter Dependencies:

  • Requires: Chapter 79 (Voice AI Fundamentals) - architecture mental models
  • Requires: Chapter 80 (LiveKit Agents) - comparison context, voice pipeline understanding
  • Prepares for: Chapter 82 (OpenAI Realtime API) - direct API access after framework abstraction
  • Prepares for: Chapter 85 (Capstone) - production voice agent

V. Assessment Plan

Formative Assessments (During Lessons)

  • Lesson 0: Skill generation verification (skill works, matches docs)
  • Lesson 1: Pipeline code review (correct frame flow)
  • Lesson 2: Provider swap demonstration (change provider without breaking pipeline)

Summative Assessment (End of Chapter)

Chapter 81 Quiz:

  1. Frame Architecture: Explain how frames flow through processors
  2. Frame Types: When to use AudioRawFrame vs TextFrame vs EndFrame
  3. Transports: Compare Daily vs WebSocket vs Local transports
  4. Providers: How to swap STT provider without changing pipeline
  5. Custom Processors: When and how to implement custom transformations

Practical Assessment:

  • Build a voice pipeline that uses two different provider combinations
  • Implement a custom processor for your domain
  • Demonstrate transport flexibility (run same pipeline on different transports)

VI. Validation Checklist

Chapter-Level Validation:

  • Chapter type identified: TECHNICAL (SKILL-FIRST L00 Pattern)
  • Concept density analysis documented: 9 concepts across 3 lessons
  • Lesson count justified: 3 lessons (~3 concepts each, within B1-B2 limit)
  • All evals covered by lessons
  • All lessons map to at least one eval

Stage Progression Validation:

  • Lesson 0: Layer 1 + Layer 2 (skill creation with AI collaboration)
  • Lesson 1: Layer 2 (AI collaboration, skill improvement)
  • Lesson 2: Layer 2 + Layer 3 (provider integration, custom processors)
  • No premature spec-driven content (that's Chapter 85 Capstone)

Cognitive Load Validation:

  • Lesson 0: 2 concepts <= 10 (B1 limit) PASS
  • Lesson 1: 4 concepts <= 10 (B1 limit) PASS
  • Lesson 2: 3 concepts <= 10 (B1-B2 limit) PASS

L00 Pattern Requirements:

  • Lesson 0 creates skill from official documentation
  • Fresh clone of skills-lab (no state assumptions)
  • LEARNING-SPEC.md written before skill creation
  • /fetching-library-docs used for documentation
  • Skill tested and verified before proceeding
  • Each subsequent lesson TESTS and IMPROVES the skill
  • "Improve Your Skill" section in each lesson

Three Roles Validation (Layer 2 lessons):

  • Each Layer 2 lesson demonstrates AI as Teacher
  • Each Layer 2 lesson demonstrates AI as Student
  • Each Layer 2 lesson demonstrates AI as Co-Worker (convergence)

Canonical Source Validation:

  • Skills format follows .claude/skills/<name>/SKILL.md pattern
  • Lesson 0 references /fetching-library-docs for official docs
  • Provider patterns align with Pipecat plugin system

VII. File Structure

63-pipecat/
├── _category_.json # Existing
├── README.md # Chapter overview (create)
├── 00-build-pipecat-skill.md # Lesson 0: L00 pattern (create)
├── 01-frame-pipeline-architecture.md # Lesson 1 (create)
├── 02-multi-provider-integration.md # Lesson 2 (create)
└── 03-chapter-quiz.md # Assessment (create)

VIII. Summary

Chapter 81: Pipecat is a 3-lesson SKILL-FIRST technical chapter:

LessonTitleConceptsDurationEvals
0Build Your Pipecat Skill230 min#1
1Frame-Based Pipeline Architecture445 min#2, #3, #4
2Multi-Provider Integration & Custom Processors345 min#5, #6, #7

Total: 9 concepts, ~120 minutes, creates production-ready pipecat skill

Skill Output: .claude/skills/pipecat/SKILL.md - a reusable Digital FTE component grounded in official documentation.

Comparison to Chapter 80 (LiveKit):

  • Chapter 80: 4 lessons, 10 concepts, distributed architecture focus
  • Chapter 81: 3 lessons, 9 concepts, modular composition focus
  • Together: Complete voice framework toolkit for Digital FTEs