Skip to main content

The Voice AI Technology Stack

You understand the two architectures now: Native Speech-to-Speech for premium experiences, Cascaded Pipeline for cost efficiency. But when you choose the cascaded approach, you face a new question: Which providers power each component?

The voice AI stack is modular. Each component (STT, TTS, VAD, transport) can be swapped independently. This flexibility is powerful, but it demands informed choices. The wrong STT provider adds 200ms of latency. The wrong TTS makes your agent sound robotic. The wrong transport protocol fails behind corporate firewalls.

This lesson maps the technology landscape so you can design stacks that match your requirements. You will learn the major providers, their tradeoffs, and when to use each. By the end, you will be able to specify a complete production stack with confidence.


The Component Stack Overview

Every voice agent assembles four fundamental layers:

LayerFunctionKey Decision
TransportMove audio between client and serverWebRTC vs WebSocket
VADDetect when user is speakingSilero VAD vs alternatives
STTConvert speech to textProvider selection (Deepgram, AssemblyAI, Whisper)
TTSConvert text to speechProvider selection (Cartesia, ElevenLabs, Deepgram Aura)

Between STT and TTS sits your LLM (GPT-4o-mini, Claude, etc.)but that decision you already know from Parts 5-6.

The insight: Frameworks like LiveKit Agents and Pipecat abstract the glue code, but you still choose the components. Understanding provider tradeoffs lets you optimize for your specific requirements.


Speech-to-Text (STT) Providers

STT converts spoken audio into text that your LLM can process. The choice matters more than you might expect. A 100ms difference in STT latency compounds every conversational turn.

Provider Comparison

ProviderStreaming LatencyWord Error Rate (WER)Price per MinuteBest For
Deepgram Nova-3~90ms5.26%$0.0077Production speed
AssemblyAI Universal~300ms P5014.5%$0.0025Budget-conscious
OpenAI Whisper APIHigher (batch)Variable$0.006Accuracy on specific accents
Gladia~100msCompetitive$0.0061Multi-language

Deepgram Nova-3: The Speed Leader

Deepgram dominates production voice agents for one reason: ~90ms streaming latency. When every millisecond matters for conversational feel, Deepgram's speed advantage is decisive.

Key characteristics:

  • Streaming-first architecture: Purpose-built for real-time applications
  • Word Error Rate: 5.26% on standard benchmarks (competitive with Whisper)
  • Language support: 36+ languages with varying accuracy
  • Integration: Native support in LiveKit Agents, Pipecat, and most voice frameworks

When to choose Deepgram: Production voice agents where sub-100ms STT latency is non-negotiable. Customer support bots, sales agents, and any application where conversational responsiveness drives user experience.

AssemblyAI Universal: The Budget Option

AssemblyAI offers strong accuracy at the lowest price point in the market:

  • ~300ms P50 latency: Slower than Deepgram, but acceptable for many use cases
  • $0.0025 per minute: 3x cheaper than Deepgram
  • Strong accuracy: Performs well on clear audio with standard accents
  • Features: Built-in speaker diarization, sentiment analysis

When to choose AssemblyAI: Cost-sensitive applications where 300ms STT latency is acceptable. High-volume transcription, internal tools, or prototypes where budget matters more than polish.

OpenAI Whisper: The Accuracy Benchmark

Whisper set the standard for transcription accuracy but was not designed for real-time streaming:

  • Higher latency: Batch-oriented architecture adds delay
  • $0.006 per minute: Mid-range pricing
  • Exceptional accuracy: Particularly strong on diverse accents and noisy audio
  • Open weights: Can self-host for cost control at scale

When to choose Whisper: When accuracy on difficult audio (heavy accents, background noise, domain-specific terminology) outweighs latency concerns. Also consider for offline transcription or when self-hosting for cost optimization.

STT Selection Framework

PriorityRecommended Provider
Lowest latencyDeepgram Nova-3
Lowest costAssemblyAI Universal
Highest accuracy on difficult audioOpenAI Whisper
Multi-language supportGladia or Deepgram
Self-hosting requiredWhisper (open weights)

Text-to-Speech (TTS) Providers

TTS gives your agent a voice. The choice shapes user perception more than any other component. A robotic voice destroys trust. A natural voice builds connection.

Provider Comparison

ProviderModel LatencyVoice QualityCloningPrice
Cartesia Sonic-340-90msHighNo~$0.05/1K chars
ElevenLabs Flash v2.5~75msPremiumYesSubscription
Deepgram AuraLowGoodNo$4.50/hr
PlayHT 3.0-mini<300msGoodLimitedPer-character

Cartesia Sonic-3: The Speed-Quality Sweet Spot

Cartesia emerged as the production favorite for cascaded pipelines:

  • 40-90ms latency: Fastest in the market
  • High voice quality: Natural prosody, appropriate emotional tone
  • Streaming-optimized: Purpose-built for real-time conversation
  • Reasonable cost: ~$0.024 per minute of speech

When to choose Cartesia: Default choice for production cascaded pipelines. Delivers the best combination of speed, quality, and cost for most voice agent use cases.

ElevenLabs: The Quality Premium

ElevenLabs sets the bar for voice quality and offers unique capabilities:

  • ~75ms latency: Fast, though not the fastest
  • Premium quality: Industry-leading naturalness and expressiveness
  • Voice cloning: Create custom voices from audio samples
  • Extensive voice library: Wide selection of pre-made voices
  • Higher cost: Subscription-based, more expensive per minute

When to choose ElevenLabs: When voice quality is a competitive differentiator. Brand voices, celebrity-style personas, or applications where premium audio justifies premium pricing. Also essential if you need voice cloning.

Deepgram Aura: The Unified Stack

Deepgram Aura offers TTS alongside their industry-leading STT:

  • Low latency: Competitive streaming performance
  • Good quality: Natural-sounding voices
  • Unified billing: Single vendor for STT + TTS simplifies operations
  • $4.50 per hour: Predictable pricing

When to choose Deepgram Aura: When you already use Deepgram for STT and want operational simplicity. One vendor, one bill, one support relationship.

PlayHT: The Voice Library

PlayHT offers the widest selection of voices:

  • 800+ voices: Extensive multilingual library
  • <300ms latency: Acceptable for many use cases
  • Voice cloning: Available on higher tiers
  • Per-character pricing: Scales with usage

When to choose PlayHT: Multilingual applications requiring diverse voice options. Also useful for experimentation when you need to test many voice styles.

TTS Selection Framework

PriorityRecommended Provider
Lowest latencyCartesia Sonic-3
Highest qualityElevenLabs
Voice cloning neededElevenLabs or PlayHT
Unified with Deepgram STTDeepgram Aura
Maximum voice varietyPlayHT

Voice Activity Detection (VAD)

VAD solves a problem you might not have considered: How does the agent know when to listen and when to speak?

The Turn-Taking Challenge

Conversation is a dance. Humans coordinate speaking turns through subtle cues: pauses, intonation changes, breath patterns. Without VAD, your agent either:

  • Interrupts users mid-sentence (frustrating)
  • Waits too long after users finish (awkward pauses)
  • Processes silence as speech (wasted compute)

VAD detects speech versus non-speech in real-time, enabling smooth turn-taking.

Silero VAD: The Industry Standard

Silero VAD dominates the voice AI ecosystem for good reasons:

CharacteristicSilero VAD
Latency<1ms per 30ms audio chunk
Model size~2MB (runs on edge devices)
Accuracy99%+ on standard benchmarks
CostFree and open-source
IntegrationBuilt into LiveKit Agents, Pipecat

How it works: Silero VAD processes 30ms audio chunks and classifies each as speech or non-speech. At <1ms per chunk, it adds negligible latency to your pipeline.

Why it won: The combination of speed, accuracy, and zero cost made alternatives unnecessary for most use cases. LiveKit and Pipecat both integrate Silero by default.

Beyond Acoustic VAD: Semantic Turn Detection

Acoustic VAD has a limitation: it detects sound, not meaning. A pause might be:

  • The user thinking (don't interrupt)
  • The end of a complete thought (respond now)
  • A dramatic pause before continuing (wait)

Semantic turn detection solves this by analyzing content, not just audio:

LiveKit Agents introduced transformer-based turn detection that considers:

  • What the user said (semantic completeness)
  • How they said it (prosody, intonation)
  • Conversation context (is a response expected?)

This reduces false interruptions and awkward pauses compared to pure acoustic VAD.

When you need semantic turn detection: Customer support where users give complex explanations. Sales calls where interruptions kill deals. Any application where getting turn-taking wrong has business consequences.


Transport Protocols: WebRTC vs WebSocket

The transport layer moves audio between the user's device and your server. The choice affects latency, reliability, and deployment complexity.

WebRTC: Built for Real-Time

WebRTC was designed for voice and video communication. It excels at low-latency, bidirectional audio:

AdvantageDescription
Low latency60-120ms peer-to-peer, UDP-based
NAT traversalWorks through firewalls with STUN/TURN
Echo cancellationBuilt-in audio processing
Browser-nativeNo plugins required
Optimized codecsOpus codec for quality + compression
DisadvantageDescription
Complex setupRequires STUN/TURN server infrastructure
Debugging difficultyNetwork issues are hard to diagnose
Learning curveTeam needs WebRTC expertise

When to choose WebRTC: Production voice agents, especially browser-based. Any application where latency matters and you have (or will build) the infrastructure expertise.

WebSocket: Simpler, Slower

WebSocket is HTTP-upgraded for bidirectional communication. It works everywhere but was not optimized for audio:

AdvantageDescription
Simple setupStandard web server, no special infrastructure
Firewall-friendlyUses port 443, passes through most firewalls
Easy debuggingStandard HTTP tools work
Quick prototypingWorks immediately with minimal code
DisadvantageDescription
Higher latencyTCP adds buffering, retransmission delays
No audio optimizationYou handle echo cancellation, codecs manually
Server-mediatedAll traffic routes through server

When to choose WebSocket: Prototypes, internal tools, controlled environments. When you need to ship something quickly and can accept latency tradeoffs. Also useful when WebRTC is blocked by network policies.

Protocol Selection Framework

ScenarioRecommended Protocol
Production customer-facing voiceWebRTC
MVP/prototypeWebSocket
Corporate network deploymentWebSocket (often passes firewalls better)
Phone integration (SIP)WebRTC (via LiveKit/Jambonz)
Maximum control over audioWebRTC
Fastest path to working demoWebSocket

Migration Path

Start with WebSocket if you need to ship quickly. Migrate to WebRTC when:

  • Users complain about latency
  • You have infrastructure budget
  • Scale demands optimization

Both LiveKit Agents and Pipecat support both protocols, making migration straightforward.


Putting It Together: The Economy Stack

You now understand each component. Here is how they combine into a production-ready, cost-effective stack:

The Economy Stack (~$0.033/minute)

ComponentProviderLatencyCost/min
TransportWebRTC (LiveKit)60-120msInfrastructure
VADSilero VAD<1msFree
STTDeepgram Nova-3~90ms$0.0077
LLMGPT-4o-mini200-400ms$0.0015
TTSCartesia Sonic-340-90ms$0.024

Total estimated cost: ~$0.033 per minute of conversation

Latency budget: 390-700ms (acceptable for most conversations)

When to Upgrade Components

SymptomComponent to UpgradeAlternative
Voice sounds roboticTTSElevenLabs
Poor accent recognitionSTTWhisper
Awkward interruptionsVADSemantic turn detection
Latency complaintsTransportDedicated WebRTC infra
Need voice cloningTTSElevenLabs

Premium Stack Alternative (~$0.11/minute)

When cost is less important than quality:

ComponentProviderWhy
STT + TTSOpenAI RealtimeNative S2S, best quality
TransportWebRTCLowest latency
FallbackCascaded pipelineWhen S2S unavailable

The premium stack costs ~3x more but delivers noticeably better conversational feel.


Try With AI

Test your understanding of the voice technology stack by designing solutions with your AI companion.

Prompt 1: Build Your Stack

I want to build a voice agent with these requirements:

- Use case: [describe: customer support, sales, appointment scheduling, etc.]
- Priority: [cost / quality / latency - pick one primary]
- Volume: [expected calls per day]
- Budget: [max cost per minute you can afford]

Help me design the optimal stack. For each component (transport, VAD, STT,
TTS), recommend a specific provider and explain why. Calculate the total
cost per minute. Then tell me: what tradeoffs am I making with this stack?

What you're learning: Stack design methodology. You translate requirements into component selections and understand the tradeoffs each choice implies. This skill transfers to any modular architecture decision.

Prompt 2: Provider Deep Dive

I'm evaluating TTS providers for a voice Digital FTE. Compare these three
for me in depth:

1. Cartesia Sonic-3
2. ElevenLabs Flash v2.5
3. Deepgram Aura

For each provider:
- What's the unique strength that makes it the right choice?
- What specific use case would make it the obvious winner?
- What limitation would make you avoid it?

Give me real examples from production voice agents, not just abstract
comparisons. I want to understand when I'd regret choosing the wrong one.

What you're learning: Vendor evaluation beyond marketing claims. You develop intuition for when provider differences actually matter versus when they are marketing noise. This skill applies to any technology selection.

Prompt 3: WebRTC vs WebSocket Decision

I'm building a voice interface and need to decide on transport protocol.

My context:
- Platform: [Browser-based / Phone-based / Both]
- Team experience: [No WebRTC experience / Some / Expert]
- Timeline: [MVP in 2 weeks / Production in 3 months / Long-term product]
- Scale: [Demo / Hundreds of users / Thousands concurrent]
- Network environment: [Open internet / Corporate networks / Both]

Help me decide: WebRTC or WebSocket? Walk me through your reasoning.

Then tell me: If I start with one and need to switch, what's the migration
path? What will I regret if I choose wrong for my specific situation?

What you're learning: Technology selection under constraints. Real decisions involve incomplete information, team limitations, and business timelines. You practice making pragmatic choices rather than theoretically optimal ones.

Safety Note

When evaluating providers, request trials or demos before committing. Latency numbers vary based on your location and use case. Cost structures change; verify current pricing before making budget decisions. The landscape evolves rapidly; the best choice today may not be the best choice in six months.