The Voice AI Technology Stack
You understand the two architectures now: Native Speech-to-Speech for premium experiences, Cascaded Pipeline for cost efficiency. But when you choose the cascaded approach, you face a new question: Which providers power each component?
The voice AI stack is modular. Each component (STT, TTS, VAD, transport) can be swapped independently. This flexibility is powerful, but it demands informed choices. The wrong STT provider adds 200ms of latency. The wrong TTS makes your agent sound robotic. The wrong transport protocol fails behind corporate firewalls.
This lesson maps the technology landscape so you can design stacks that match your requirements. You will learn the major providers, their tradeoffs, and when to use each. By the end, you will be able to specify a complete production stack with confidence.
The Component Stack Overview
Every voice agent assembles four fundamental layers:
| Layer | Function | Key Decision |
|---|---|---|
| Transport | Move audio between client and server | WebRTC vs WebSocket |
| VAD | Detect when user is speaking | Silero VAD vs alternatives |
| STT | Convert speech to text | Provider selection (Deepgram, AssemblyAI, Whisper) |
| TTS | Convert text to speech | Provider selection (Cartesia, ElevenLabs, Deepgram Aura) |
Between STT and TTS sits your LLM (GPT-4o-mini, Claude, etc.)but that decision you already know from Parts 5-6.
The insight: Frameworks like LiveKit Agents and Pipecat abstract the glue code, but you still choose the components. Understanding provider tradeoffs lets you optimize for your specific requirements.
Speech-to-Text (STT) Providers
STT converts spoken audio into text that your LLM can process. The choice matters more than you might expect. A 100ms difference in STT latency compounds every conversational turn.
Provider Comparison
| Provider | Streaming Latency | Word Error Rate (WER) | Price per Minute | Best For |
|---|---|---|---|---|
| Deepgram Nova-3 | ~90ms | 5.26% | $0.0077 | Production speed |
| AssemblyAI Universal | ~300ms P50 | 14.5% | $0.0025 | Budget-conscious |
| OpenAI Whisper API | Higher (batch) | Variable | $0.006 | Accuracy on specific accents |
| Gladia | ~100ms | Competitive | $0.0061 | Multi-language |
Deepgram Nova-3: The Speed Leader
Deepgram dominates production voice agents for one reason: ~90ms streaming latency. When every millisecond matters for conversational feel, Deepgram's speed advantage is decisive.
Key characteristics:
- Streaming-first architecture: Purpose-built for real-time applications
- Word Error Rate: 5.26% on standard benchmarks (competitive with Whisper)
- Language support: 36+ languages with varying accuracy
- Integration: Native support in LiveKit Agents, Pipecat, and most voice frameworks
When to choose Deepgram: Production voice agents where sub-100ms STT latency is non-negotiable. Customer support bots, sales agents, and any application where conversational responsiveness drives user experience.
AssemblyAI Universal: The Budget Option
AssemblyAI offers strong accuracy at the lowest price point in the market:
- ~300ms P50 latency: Slower than Deepgram, but acceptable for many use cases
- $0.0025 per minute: 3x cheaper than Deepgram
- Strong accuracy: Performs well on clear audio with standard accents
- Features: Built-in speaker diarization, sentiment analysis
When to choose AssemblyAI: Cost-sensitive applications where 300ms STT latency is acceptable. High-volume transcription, internal tools, or prototypes where budget matters more than polish.
OpenAI Whisper: The Accuracy Benchmark
Whisper set the standard for transcription accuracy but was not designed for real-time streaming:
- Higher latency: Batch-oriented architecture adds delay
- $0.006 per minute: Mid-range pricing
- Exceptional accuracy: Particularly strong on diverse accents and noisy audio
- Open weights: Can self-host for cost control at scale
When to choose Whisper: When accuracy on difficult audio (heavy accents, background noise, domain-specific terminology) outweighs latency concerns. Also consider for offline transcription or when self-hosting for cost optimization.
STT Selection Framework
| Priority | Recommended Provider |
|---|---|
| Lowest latency | Deepgram Nova-3 |
| Lowest cost | AssemblyAI Universal |
| Highest accuracy on difficult audio | OpenAI Whisper |
| Multi-language support | Gladia or Deepgram |
| Self-hosting required | Whisper (open weights) |
Text-to-Speech (TTS) Providers
TTS gives your agent a voice. The choice shapes user perception more than any other component. A robotic voice destroys trust. A natural voice builds connection.
Provider Comparison
| Provider | Model Latency | Voice Quality | Cloning | Price |
|---|---|---|---|---|
| Cartesia Sonic-3 | 40-90ms | High | No | ~$0.05/1K chars |
| ElevenLabs Flash v2.5 | ~75ms | Premium | Yes | Subscription |
| Deepgram Aura | Low | Good | No | $4.50/hr |
| PlayHT 3.0-mini | <300ms | Good | Limited | Per-character |
Cartesia Sonic-3: The Speed-Quality Sweet Spot
Cartesia emerged as the production favorite for cascaded pipelines:
- 40-90ms latency: Fastest in the market
- High voice quality: Natural prosody, appropriate emotional tone
- Streaming-optimized: Purpose-built for real-time conversation
- Reasonable cost: ~$0.024 per minute of speech
When to choose Cartesia: Default choice for production cascaded pipelines. Delivers the best combination of speed, quality, and cost for most voice agent use cases.
ElevenLabs: The Quality Premium
ElevenLabs sets the bar for voice quality and offers unique capabilities:
- ~75ms latency: Fast, though not the fastest
- Premium quality: Industry-leading naturalness and expressiveness
- Voice cloning: Create custom voices from audio samples
- Extensive voice library: Wide selection of pre-made voices
- Higher cost: Subscription-based, more expensive per minute
When to choose ElevenLabs: When voice quality is a competitive differentiator. Brand voices, celebrity-style personas, or applications where premium audio justifies premium pricing. Also essential if you need voice cloning.
Deepgram Aura: The Unified Stack
Deepgram Aura offers TTS alongside their industry-leading STT:
- Low latency: Competitive streaming performance
- Good quality: Natural-sounding voices
- Unified billing: Single vendor for STT + TTS simplifies operations
- $4.50 per hour: Predictable pricing
When to choose Deepgram Aura: When you already use Deepgram for STT and want operational simplicity. One vendor, one bill, one support relationship.
PlayHT: The Voice Library
PlayHT offers the widest selection of voices:
- 800+ voices: Extensive multilingual library
- <300ms latency: Acceptable for many use cases
- Voice cloning: Available on higher tiers
- Per-character pricing: Scales with usage
When to choose PlayHT: Multilingual applications requiring diverse voice options. Also useful for experimentation when you need to test many voice styles.
TTS Selection Framework
| Priority | Recommended Provider |
|---|---|
| Lowest latency | Cartesia Sonic-3 |
| Highest quality | ElevenLabs |
| Voice cloning needed | ElevenLabs or PlayHT |
| Unified with Deepgram STT | Deepgram Aura |
| Maximum voice variety | PlayHT |
Voice Activity Detection (VAD)
VAD solves a problem you might not have considered: How does the agent know when to listen and when to speak?
The Turn-Taking Challenge
Conversation is a dance. Humans coordinate speaking turns through subtle cues: pauses, intonation changes, breath patterns. Without VAD, your agent either:
- Interrupts users mid-sentence (frustrating)
- Waits too long after users finish (awkward pauses)
- Processes silence as speech (wasted compute)
VAD detects speech versus non-speech in real-time, enabling smooth turn-taking.
Silero VAD: The Industry Standard
Silero VAD dominates the voice AI ecosystem for good reasons:
| Characteristic | Silero VAD |
|---|---|
| Latency | <1ms per 30ms audio chunk |
| Model size | ~2MB (runs on edge devices) |
| Accuracy | 99%+ on standard benchmarks |
| Cost | Free and open-source |
| Integration | Built into LiveKit Agents, Pipecat |
How it works: Silero VAD processes 30ms audio chunks and classifies each as speech or non-speech. At <1ms per chunk, it adds negligible latency to your pipeline.
Why it won: The combination of speed, accuracy, and zero cost made alternatives unnecessary for most use cases. LiveKit and Pipecat both integrate Silero by default.
Beyond Acoustic VAD: Semantic Turn Detection
Acoustic VAD has a limitation: it detects sound, not meaning. A pause might be:
- The user thinking (don't interrupt)
- The end of a complete thought (respond now)
- A dramatic pause before continuing (wait)
Semantic turn detection solves this by analyzing content, not just audio:
LiveKit Agents introduced transformer-based turn detection that considers:
- What the user said (semantic completeness)
- How they said it (prosody, intonation)
- Conversation context (is a response expected?)
This reduces false interruptions and awkward pauses compared to pure acoustic VAD.
When you need semantic turn detection: Customer support where users give complex explanations. Sales calls where interruptions kill deals. Any application where getting turn-taking wrong has business consequences.
Transport Protocols: WebRTC vs WebSocket
The transport layer moves audio between the user's device and your server. The choice affects latency, reliability, and deployment complexity.
WebRTC: Built for Real-Time
WebRTC was designed for voice and video communication. It excels at low-latency, bidirectional audio:
| Advantage | Description |
|---|---|
| Low latency | 60-120ms peer-to-peer, UDP-based |
| NAT traversal | Works through firewalls with STUN/TURN |
| Echo cancellation | Built-in audio processing |
| Browser-native | No plugins required |
| Optimized codecs | Opus codec for quality + compression |
| Disadvantage | Description |
|---|---|
| Complex setup | Requires STUN/TURN server infrastructure |
| Debugging difficulty | Network issues are hard to diagnose |
| Learning curve | Team needs WebRTC expertise |
When to choose WebRTC: Production voice agents, especially browser-based. Any application where latency matters and you have (or will build) the infrastructure expertise.
WebSocket: Simpler, Slower
WebSocket is HTTP-upgraded for bidirectional communication. It works everywhere but was not optimized for audio:
| Advantage | Description |
|---|---|
| Simple setup | Standard web server, no special infrastructure |
| Firewall-friendly | Uses port 443, passes through most firewalls |
| Easy debugging | Standard HTTP tools work |
| Quick prototyping | Works immediately with minimal code |
| Disadvantage | Description |
|---|---|
| Higher latency | TCP adds buffering, retransmission delays |
| No audio optimization | You handle echo cancellation, codecs manually |
| Server-mediated | All traffic routes through server |
When to choose WebSocket: Prototypes, internal tools, controlled environments. When you need to ship something quickly and can accept latency tradeoffs. Also useful when WebRTC is blocked by network policies.
Protocol Selection Framework
| Scenario | Recommended Protocol |
|---|---|
| Production customer-facing voice | WebRTC |
| MVP/prototype | WebSocket |
| Corporate network deployment | WebSocket (often passes firewalls better) |
| Phone integration (SIP) | WebRTC (via LiveKit/Jambonz) |
| Maximum control over audio | WebRTC |
| Fastest path to working demo | WebSocket |
Migration Path
Start with WebSocket if you need to ship quickly. Migrate to WebRTC when:
- Users complain about latency
- You have infrastructure budget
- Scale demands optimization
Both LiveKit Agents and Pipecat support both protocols, making migration straightforward.
Putting It Together: The Economy Stack
You now understand each component. Here is how they combine into a production-ready, cost-effective stack:
The Economy Stack (~$0.033/minute)
| Component | Provider | Latency | Cost/min |
|---|---|---|---|
| Transport | WebRTC (LiveKit) | 60-120ms | Infrastructure |
| VAD | Silero VAD | <1ms | Free |
| STT | Deepgram Nova-3 | ~90ms | $0.0077 |
| LLM | GPT-4o-mini | 200-400ms | $0.0015 |
| TTS | Cartesia Sonic-3 | 40-90ms | $0.024 |
Total estimated cost: ~$0.033 per minute of conversation
Latency budget: 390-700ms (acceptable for most conversations)
When to Upgrade Components
| Symptom | Component to Upgrade | Alternative |
|---|---|---|
| Voice sounds robotic | TTS | ElevenLabs |
| Poor accent recognition | STT | Whisper |
| Awkward interruptions | VAD | Semantic turn detection |
| Latency complaints | Transport | Dedicated WebRTC infra |
| Need voice cloning | TTS | ElevenLabs |
Premium Stack Alternative (~$0.11/minute)
When cost is less important than quality:
| Component | Provider | Why |
|---|---|---|
| STT + TTS | OpenAI Realtime | Native S2S, best quality |
| Transport | WebRTC | Lowest latency |
| Fallback | Cascaded pipeline | When S2S unavailable |
The premium stack costs ~3x more but delivers noticeably better conversational feel.
Try With AI
Test your understanding of the voice technology stack by designing solutions with your AI companion.
Prompt 1: Build Your Stack
I want to build a voice agent with these requirements:
- Use case: [describe: customer support, sales, appointment scheduling, etc.]
- Priority: [cost / quality / latency - pick one primary]
- Volume: [expected calls per day]
- Budget: [max cost per minute you can afford]
Help me design the optimal stack. For each component (transport, VAD, STT,
TTS), recommend a specific provider and explain why. Calculate the total
cost per minute. Then tell me: what tradeoffs am I making with this stack?
What you're learning: Stack design methodology. You translate requirements into component selections and understand the tradeoffs each choice implies. This skill transfers to any modular architecture decision.
Prompt 2: Provider Deep Dive
I'm evaluating TTS providers for a voice Digital FTE. Compare these three
for me in depth:
1. Cartesia Sonic-3
2. ElevenLabs Flash v2.5
3. Deepgram Aura
For each provider:
- What's the unique strength that makes it the right choice?
- What specific use case would make it the obvious winner?
- What limitation would make you avoid it?
Give me real examples from production voice agents, not just abstract
comparisons. I want to understand when I'd regret choosing the wrong one.
What you're learning: Vendor evaluation beyond marketing claims. You develop intuition for when provider differences actually matter versus when they are marketing noise. This skill applies to any technology selection.
Prompt 3: WebRTC vs WebSocket Decision
I'm building a voice interface and need to decide on transport protocol.
My context:
- Platform: [Browser-based / Phone-based / Both]
- Team experience: [No WebRTC experience / Some / Expert]
- Timeline: [MVP in 2 weeks / Production in 3 months / Long-term product]
- Scale: [Demo / Hundreds of users / Thousands concurrent]
- Network environment: [Open internet / Corporate networks / Both]
Help me decide: WebRTC or WebSocket? Walk me through your reasoning.
Then tell me: If I start with one and need to switch, what's the migration
path? What will I regret if I choose wrong for my specific situation?
What you're learning: Technology selection under constraints. Real decisions involve incomplete information, team limitations, and business timelines. You practice making pragmatic choices rather than theoretically optimal ones.
Safety Note
When evaluating providers, request trials or demos before committing. Latency numbers vary based on your location and use case. Cost structures change; verify current pricing before making budget decisions. The landscape evolves rapidly; the best choice today may not be the best choice in six months.