Updated Feb 23, 2026

The Voice AI Technology Stack

You understand the two architectures now: Native Speech-to-Speech for premium experiences, Cascaded Pipeline for cost efficiency. But when you choose the cascaded approach, you face a new question: Which providers power each component?

The voice AI stack is modular. Each component (STT, TTS, VAD, transport) can be swapped independently. This flexibility is powerful, but it demands informed choices. The wrong STT provider adds 200ms of latency. The wrong TTS makes your agent sound robotic. The wrong transport protocol fails behind corporate firewalls.

This lesson maps the technology landscape so you can design stacks that match your requirements. You will learn the major providers, their tradeoffs, and when to use each. By the end, you will be able to specify a complete production stack with confidence.

The Component Stack Overview

Every voice agent assembles four fundamental layers:

Layer	Function	Key Decision
Transport	Move audio between client and server	WebRTC vs WebSocket
VAD	Detect when user is speaking	Silero VAD vs alternatives
STT	Convert speech to text	Provider selection (Deepgram, AssemblyAI, Whisper)
TTS	Convert text to speech	Provider selection (Cartesia, ElevenLabs, Deepgram Aura)

Between STT and TTS sits your LLM (GPT-4o-mini, Claude, etc.)but that decision you already know from Parts 5-6.

The insight: Frameworks like LiveKit Agents and Pipecat abstract the glue code, but you still choose the components. Understanding provider tradeoffs lets you optimize for your specific requirements.

Speech-to-Text (STT) Providers

STT converts spoken audio into text that your LLM can process. The choice matters more than you might expect. A 100ms difference in STT latency compounds every conversational turn.

Provider Comparison

Provider	Streaming Latency	Word Error Rate (WER)	Price per Minute	Best For
Deepgram Nova-3	~90ms	5.26%	$0.0077	Production speed
AssemblyAI Universal	~300ms P50	14.5%	$0.0025	Budget-conscious
OpenAI Whisper API	Higher (batch)	Variable	$0.006	Accuracy on specific accents
Gladia	~100ms	Competitive	$0.0061	Multi-language

Deepgram Nova-3: The Speed Leader

Deepgram dominates production voice agents for one reason: ~90ms streaming latency. When every millisecond matters for conversational feel, Deepgram's speed advantage is decisive.

Key characteristics:

Streaming-first architecture: Purpose-built for real-time applications
Word Error Rate: 5.26% on standard benchmarks (competitive with Whisper)
Language support: 36+ languages with varying accuracy
Integration: Native support in LiveKit Agents, Pipecat, and most voice frameworks

When to choose Deepgram: Production voice agents where sub-100ms STT latency is non-negotiable. Customer support bots, sales agents, and any application where conversational responsiveness drives user experience.

AssemblyAI Universal: The Budget Option

AssemblyAI offers strong accuracy at the lowest price point in the market:

~300ms P50 latency: Slower than Deepgram, but acceptable for many use cases
$0.0025 per minute: 3x cheaper than Deepgram
Strong accuracy: Performs well on clear audio with standard accents
Features: Built-in speaker diarization, sentiment analysis

When to choose AssemblyAI: Cost-sensitive applications where 300ms STT latency is acceptable. High-volume transcription, internal tools, or prototypes where budget matters more than polish.

OpenAI Whisper: The Accuracy Benchmark

Whisper set the standard for transcription accuracy but was not designed for real-time streaming:

Higher latency: Batch-oriented architecture adds delay
$0.006 per minute: Mid-range pricing
Exceptional accuracy: Particularly strong on diverse accents and noisy audio
Open weights: Can self-host for cost control at scale

When to choose Whisper: When accuracy on difficult audio (heavy accents, background noise, domain-specific terminology) outweighs latency concerns. Also consider for offline transcription or when self-hosting for cost optimization.

STT Selection Framework

Priority	Recommended Provider
Lowest latency	Deepgram Nova-3
Lowest cost	AssemblyAI Universal
Highest accuracy on difficult audio	OpenAI Whisper
Multi-language support	Gladia or Deepgram
Self-hosting required	Whisper (open weights)

Text-to-Speech (TTS) Providers

TTS gives your agent a voice. The choice shapes user perception more than any other component. A robotic voice destroys trust. A natural voice builds connection.

Provider Comparison

Provider	Model Latency	Voice Quality	Cloning	Price
Cartesia Sonic-3	40-90ms	High	No	~$0.05/1K chars
ElevenLabs Flash v2.5	~75ms	Premium	Yes	Subscription
Deepgram Aura	Low	Good	No	$4.50/hr
PlayHT 3.0-mini	<300ms	Good	Limited	Per-character

Cartesia Sonic-3: The Speed-Quality Sweet Spot

Cartesia emerged as the production favorite for cascaded pipelines:

40-90ms latency: Fastest in the market
High voice quality: Natural prosody, appropriate emotional tone
Streaming-optimized: Purpose-built for real-time conversation
Reasonable cost: ~$0.024 per minute of speech

When to choose Cartesia: Default choice for production cascaded pipelines. Delivers the best combination of speed, quality, and cost for most voice agent use cases.

ElevenLabs: The Quality Premium

ElevenLabs sets the bar for voice quality and offers unique capabilities:

~75ms latency: Fast, though not the fastest
Premium quality: Industry-leading naturalness and expressiveness
Voice cloning: Create custom voices from audio samples
Extensive voice library: Wide selection of pre-made voices
Higher cost: Subscription-based, more expensive per minute

When to choose ElevenLabs: When voice quality is a competitive differentiator. Brand voices, celebrity-style personas, or applications where premium audio justifies premium pricing. Also essential if you need voice cloning.

Deepgram Aura: The Unified Stack

Deepgram Aura offers TTS alongside their industry-leading STT:

Low latency: Competitive streaming performance
Good quality: Natural-sounding voices
Unified billing: Single vendor for STT + TTS simplifies operations
$4.50 per hour: Predictable pricing

When to choose Deepgram Aura: When you already use Deepgram for STT and want operational simplicity. One vendor, one bill, one support relationship.

PlayHT: The Voice Library

PlayHT offers the widest selection of voices:

800+ voices: Extensive multilingual library
<300ms latency: Acceptable for many use cases
Voice cloning: Available on higher tiers
Per-character pricing: Scales with usage

When to choose PlayHT: Multilingual applications requiring diverse voice options. Also useful for experimentation when you need to test many voice styles.

TTS Selection Framework

Priority	Recommended Provider
Lowest latency	Cartesia Sonic-3
Highest quality	ElevenLabs
Voice cloning needed	ElevenLabs or PlayHT
Unified with Deepgram STT	Deepgram Aura
Maximum voice variety	PlayHT

Voice Activity Detection (VAD)

VAD solves a problem you might not have considered: How does the agent know when to listen and when to speak?

The Turn-Taking Challenge

Conversation is a dance. Humans coordinate speaking turns through subtle cues: pauses, intonation changes, breath patterns. Without VAD, your agent either:

Interrupts users mid-sentence (frustrating)
Waits too long after users finish (awkward pauses)
Processes silence as speech (wasted compute)

VAD detects speech versus non-speech in real-time, enabling smooth turn-taking.

Silero VAD: The Industry Standard

Silero VAD dominates the voice AI ecosystem for good reasons:

Characteristic	Silero VAD
Latency	<1ms per 30ms audio chunk
Model size	~2MB (runs on edge devices)
Accuracy	99%+ on standard benchmarks
Cost	Free and open-source
Integration	Built into LiveKit Agents, Pipecat

How it works: Silero VAD processes 30ms audio chunks and classifies each as speech or non-speech. At <1ms per chunk, it adds negligible latency to your pipeline.

Why it won: The combination of speed, accuracy, and zero cost made alternatives unnecessary for most use cases. LiveKit and Pipecat both integrate Silero by default.

Beyond Acoustic VAD: Semantic Turn Detection

Acoustic VAD has a limitation: it detects sound, not meaning. A pause might be:

The user thinking (don't interrupt)
The end of a complete thought (respond now)
A dramatic pause before continuing (wait)

Semantic turn detection solves this by analyzing content, not just audio:

LiveKit Agents introduced transformer-based turn detection that considers:

What the user said (semantic completeness)
How they said it (prosody, intonation)
Conversation context (is a response expected?)

This reduces false interruptions and awkward pauses compared to pure acoustic VAD.

When you need semantic turn detection: Customer support where users give complex explanations. Sales calls where interruptions kill deals. Any application where getting turn-taking wrong has business consequences.

Transport Protocols: WebRTC vs WebSocket

The transport layer moves audio between the user's device and your server. The choice affects latency, reliability, and deployment complexity.

WebRTC: Built for Real-Time

WebRTC was designed for voice and video communication. It excels at low-latency, bidirectional audio:

Advantage	Description
Low latency	60-120ms peer-to-peer, UDP-based
NAT traversal	Works through firewalls with STUN/TURN
Echo cancellation	Built-in audio processing
Browser-native	No plugins required
Optimized codecs	Opus codec for quality + compression

Disadvantage	Description
Complex setup	Requires STUN/TURN server infrastructure
Debugging difficulty	Network issues are hard to diagnose
Learning curve	Team needs WebRTC expertise

When to choose WebRTC: Production voice agents, especially browser-based. Any application where latency matters and you have (or will build) the infrastructure expertise.

WebSocket: Simpler, Slower

WebSocket is HTTP-upgraded for bidirectional communication. It works everywhere but was not optimized for audio:

Advantage	Description
Simple setup	Standard web server, no special infrastructure
Firewall-friendly	Uses port 443, passes through most firewalls
Easy debugging	Standard HTTP tools work
Quick prototyping	Works immediately with minimal code

Disadvantage	Description
Higher latency	TCP adds buffering, retransmission delays
No audio optimization	You handle echo cancellation, codecs manually
Server-mediated	All traffic routes through server

When to choose WebSocket: Prototypes, internal tools, controlled environments. When you need to ship something quickly and can accept latency tradeoffs. Also useful when WebRTC is blocked by network policies.

Protocol Selection Framework

Scenario	Recommended Protocol
Production customer-facing voice	WebRTC
MVP/prototype	WebSocket
Corporate network deployment	WebSocket (often passes firewalls better)
Phone integration (SIP)	WebRTC (via LiveKit/Jambonz)
Maximum control over audio	WebRTC
Fastest path to working demo	WebSocket

Migration Path

Start with WebSocket if you need to ship quickly. Migrate to WebRTC when:

Users complain about latency
You have infrastructure budget
Scale demands optimization

Both LiveKit Agents and Pipecat support both protocols, making migration straightforward.

Putting It Together: The Economy Stack

You now understand each component. Here is how they combine into a production-ready, cost-effective stack:

The Economy Stack (~$0.033/minute)

Component	Provider	Latency	Cost/min
Transport	WebRTC (LiveKit)	60-120ms	Infrastructure
VAD	Silero VAD	<1ms	Free
STT	Deepgram Nova-3	~90ms	$0.0077
LLM	GPT-4o-mini	200-400ms	$0.0015
TTS	Cartesia Sonic-3	40-90ms	$0.024

Total estimated cost: ~$0.033 per minute of conversation

Latency budget: 390-700ms (acceptable for most conversations)

When to Upgrade Components

Symptom	Component to Upgrade	Alternative
Voice sounds robotic	TTS	ElevenLabs
Poor accent recognition	STT	Whisper
Awkward interruptions	VAD	Semantic turn detection
Latency complaints	Transport	Dedicated WebRTC infra
Need voice cloning	TTS	ElevenLabs

Premium Stack Alternative (~$0.11/minute)

When cost is less important than quality:

Component	Provider	Why
STT + TTS	OpenAI Realtime	Native S2S, best quality
Transport	WebRTC	Lowest latency
Fallback	Cascaded pipeline	When S2S unavailable

The premium stack costs ~3x more but delivers noticeably better conversational feel.

Try With AI

Test your understanding of the voice technology stack by designing solutions with your AI companion.

Prompt 1: Build Your Stack

I want to build a voice agent with these requirements:

- Use case: [describe: customer support, sales, appointment scheduling, etc.]
- Priority: [cost / quality / latency - pick one primary]
- Volume: [expected calls per day]
- Budget: [max cost per minute you can afford]

Help me design the optimal stack. For each component (transport, VAD, STT,
TTS), recommend a specific provider and explain why. Calculate the total
cost per minute. Then tell me: what tradeoffs am I making with this stack?

What you're learning: Stack design methodology. You translate requirements into component selections and understand the tradeoffs each choice implies. This skill transfers to any modular architecture decision.

Prompt 2: Provider Deep Dive

I'm evaluating TTS providers for a voice Digital FTE. Compare these three
for me in depth:

1. Cartesia Sonic-3
2. ElevenLabs Flash v2.5
3. Deepgram Aura

For each provider:
- What's the unique strength that makes it the right choice?
- What specific use case would make it the obvious winner?
- What limitation would make you avoid it?

Give me real examples from production voice agents, not just abstract
comparisons. I want to understand when I'd regret choosing the wrong one.

What you're learning: Vendor evaluation beyond marketing claims. You develop intuition for when provider differences actually matter versus when they are marketing noise. This skill applies to any technology selection.

Prompt 3: WebRTC vs WebSocket Decision

I'm building a voice interface and need to decide on transport protocol.

My context:
- Platform: [Browser-based / Phone-based / Both]
- Team experience: [No WebRTC experience / Some / Expert]
- Timeline: [MVP in 2 weeks / Production in 3 months / Long-term product]
- Scale: [Demo / Hundreds of users / Thousands concurrent]
- Network environment: [Open internet / Corporate networks / Both]

Help me decide: WebRTC or WebSocket? Walk me through your reasoning.

Then tell me: If I start with one and need to switch, what's the migration
path? What will I regret if I choose wrong for my specific situation?

What you're learning: Technology selection under constraints. Real decisions involve incomplete information, team limitations, and business timelines. You practice making pragmatic choices rather than theoretically optimal ones.

Safety Note

When evaluating providers, request trials or demos before committing. Latency numbers vary based on your location and use case. Cost structures change; verify current pricing before making budget decisions. The landscape evolves rapidly; the best choice today may not be the best choice in six months.

The Component Stack Overview​

Speech-to-Text (STT) Providers​

Provider Comparison​

Deepgram Nova-3: The Speed Leader​

AssemblyAI Universal: The Budget Option​

OpenAI Whisper: The Accuracy Benchmark​

STT Selection Framework​

Text-to-Speech (TTS) Providers​

Provider Comparison​

Cartesia Sonic-3: The Speed-Quality Sweet Spot​

ElevenLabs: The Quality Premium​

Deepgram Aura: The Unified Stack​

PlayHT: The Voice Library​

TTS Selection Framework​

Voice Activity Detection (VAD)​

The Turn-Taking Challenge​

Silero VAD: The Industry Standard​

Beyond Acoustic VAD: Semantic Turn Detection​

Transport Protocols: WebRTC vs WebSocket​

WebRTC: Built for Real-Time​

WebSocket: Simpler, Slower​

Protocol Selection Framework​

Migration Path​

Putting It Together: The Economy Stack​

The Economy Stack (~$0.033/minute)​

When to Upgrade Components​

Premium Stack Alternative (~$0.11/minute)​

Try With AI​

Prompt 1: Build Your Stack​

Prompt 2: Provider Deep Dive​

Prompt 3: WebRTC vs WebSocket Decision​

Safety Note​

The Component Stack Overview

Speech-to-Text (STT) Providers

Provider Comparison

Deepgram Nova-3: The Speed Leader

AssemblyAI Universal: The Budget Option

OpenAI Whisper: The Accuracy Benchmark

STT Selection Framework

Text-to-Speech (TTS) Providers

Provider Comparison

Cartesia Sonic-3: The Speed-Quality Sweet Spot

ElevenLabs: The Quality Premium

Deepgram Aura: The Unified Stack

PlayHT: The Voice Library

TTS Selection Framework

Voice Activity Detection (VAD)

The Turn-Taking Challenge

Silero VAD: The Industry Standard

Beyond Acoustic VAD: Semantic Turn Detection

Transport Protocols: WebRTC vs WebSocket

WebRTC: Built for Real-Time

WebSocket: Simpler, Slower

Protocol Selection Framework

Migration Path

Putting It Together: The Economy Stack

The Economy Stack (~$0.033/minute)

When to Upgrade Components

Premium Stack Alternative (~$0.11/minute)

Try With AI

Prompt 1: Build Your Stack

Prompt 2: Provider Deep Dive

Prompt 3: WebRTC vs WebSocket Decision

Safety Note