Updated Feb 23, 2026

Gemini Live Fundamentals

In Chapter 82, you connected directly to OpenAI's Realtime API. You mastered WebRTC negotiation, PCM16 audio at 24kHz, and native speech-to-speech processing.

Now you explore Google's approach. The Gemini Live API takes a different path: WebSocket instead of WebRTC, multimodal by design, and capabilities OpenAI does not offer. This lesson teaches you to establish your first Gemini Live connection and understand what makes it distinctive.

Gemini 2.5 Flash Native Audio: What Changes

Google announced Gemini 2.5 Flash Native Audio in May 2025. Unlike cascaded pipelines, this model processes audio directly:

Traditional Pipeline:
User Speaks → STT → Text → LLM → Text → TTS → Agent Speaks
                         4 conversion steps

Gemini Native Audio:
User Speaks → gemini-2.5-flash-native-audio-preview → Agent Speaks
                         Direct processing

The model "hears" your voice, including tone, pace, and emotion. It generates speech with matching qualities.

What Makes Gemini Live Distinctive

Capability	Traditional Pipeline	OpenAI Realtime	Gemini Live
Native audio	No	Yes	Yes
Affective dialog	No	Limited	Advanced
Proactive audio	No	No	Yes
Voice + vision	No	Limited	Native
Voices	Varies by TTS	10+	30 HD
Languages	Varies by STT	10+	24+
Live translation	External	No	70+ language pairs

Affective dialog: The model detects frustration, sadness, or excitement in your voice and adapts its response tone accordingly.

Proactive audio: The model decides when to speak. In smart device scenarios, it stays silent for background conversation and responds only to device-directed queries.

These capabilities transform what voice agents can do.

The Live API Protocol

Gemini Live API uses WebSocket for bidirectional streaming:

Connection Flow

┌─────────────┐                          ┌─────────────────────┐
│   Client    │                          │   Gemini Live API   │
│ (Your Code) │                          │       Server        │
└──────┬──────┘                          └──────────┬──────────┘
       │                                            │
       │  1. WebSocket connect (with API key)       │
       │  ─────────────────────────────────────────>│
       │                                            │
       │  2. Connection accepted                    │
       │  <─────────────────────────────────────────│
       │                                            │
       │  3. Send session config                    │
       │  ─────────────────────────────────────────>│
       │                                            │
       │  4. Session established                    │
       │  <─────────────────────────────────────────│
       │                                            │
       │  5. Stream audio chunks                    │
       │  ═════════════════════════════════════════>│
       │                                            │
       │  6. Receive audio responses                │
       │  <═════════════════════════════════════════│
       │                                            │
       │  7. Receive transcripts, events            │
       │  <═════════════════════════════════════════│
       │                                            │

Key Differences from OpenAI Realtime

Aspect	OpenAI Realtime	Gemini Live API
Transport	WebRTC	WebSocket
Audio format	24kHz PCM16	16kHz PCM16
Control plane	DataChannel	Same WebSocket
SDP/ICE negotiation	Required	Not needed
Browser native	Yes (via RTCPeerConnection)	Requires audio worklet

WebSocket is simpler to implement server-side. No SDP negotiation. No ICE candidate exchange. Connect, authenticate, stream.

Audio Format Requirements

Gemini Live API has different audio requirements than OpenAI:

Parameter	Requirement
Sample Rate	16,000 Hz (16kHz)
Bit Depth	16-bit signed integer (PCM16)
Channels	Mono (1 channel)
Encoding	Little-endian, base64 for transport

Why 16kHz Instead of 24kHz?

Google's speech models are optimized for 16kHz. This is sufficient for voice intelligibility while reducing bandwidth. Human speech energy concentrates below 8kHz; 16kHz sampling captures this with margin.

Audio Conversion

If your source uses different specifications:

import numpy as np
from scipy import signal

def convert_to_gemini_format(
    audio: np.ndarray,
    source_rate: int,
    source_channels: int
) -> bytes:
    """Convert audio to Gemini Live API format."""

    # Convert to mono if stereo
    if source_channels == 2:
        audio = audio.mean(axis=1)

    # Resample to 16kHz
    if source_rate != 16000:
        num_samples = int(len(audio) * 16000 / source_rate)
        audio = signal.resample(audio, num_samples)

    # Normalize to [-1, 1] range
    audio = audio / np.max(np.abs(audio))

    # Convert to 16-bit PCM
    audio_int16 = (audio * 32767).astype(np.int16)

    # Return as little-endian bytes
    return audio_int16.tobytes()

Output:

>>> import numpy as np
>>> # Simulate 48kHz stereo audio
>>> audio = np.random.randn(48000, 2)  # 1 second at 48kHz stereo
>>> converted = convert_to_gemini_format(audio, 48000, 2)
>>> print(f"Output bytes: {len(converted)}")
Output bytes: 32000
>>> # 16000 samples * 2 bytes/sample = 32000 bytes for 1 second

Voice Selection: 30 HD Voices

Gemini Live API offers 30 HD voices across 24 languages. Each voice has distinct characteristics:

English Voices

Voice	Character	Best For
Puck	Youthful, energetic	Consumer apps, gaming
Charon	Deep, authoritative	Business, finance
Kore	Warm, nurturing	Healthcare, education
Fenrir	Dynamic, expressive	Creative, entertainment
Aoede	Clear, professional	Information delivery
Orbit	Neutral, balanced	General purpose
Orus	Calm, measured	Support, guidance
Vesta	Friendly, approachable	Customer service

Language Coverage

Category	Languages
European	English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian
Asian	Mandarin, Japanese, Korean, Hindi, Vietnamese, Thai, Indonesian
Middle Eastern	Arabic, Turkish, Hebrew
Other	Brazilian Portuguese, Ukrainian, Czech, Romanian

Voice Selection Strategy

Choose voice based on:

Target audience: Match formality and energy to user expectations
Use case: Support calls need warmth; financial advice needs authority
Language: Native voices for each language, not translated
Brand alignment: Voice becomes your agent's personality

Session Configuration

Configure your session before streaming audio:

Configuration Message

session_config = {
    "setup": {
        "model": "models/gemini-2.5-flash-native-audio-preview",
        "generation_config": {
            "response_modalities": ["AUDIO"],
            "speech_config": {
                "voice_config": {
                    "prebuilt_voice_config": {
                        "voice_name": "Puck"  # From 30 HD voices
                    }
                }
            }
        },
        "system_instruction": {
            "parts": [{
                "text": """
                You are a helpful voice assistant for Task Manager.
                Keep responses concise - under 30 words.
                Confirm actions you take.
                """
            }]
        }
    }
}

Configuration Options

Option	Values	Purpose
`model`	`gemini-2.5-flash-native-audio-preview`	Model version
`response_modalities`	`["AUDIO"]`, `["TEXT"]`, `["AUDIO", "TEXT"]`	Output types
`voice_name`	Puck, Charon, Kore, etc.	Voice personality
`system_instruction`	Text	Agent personality and constraints

Turn Detection

Gemini Live handles turn detection automatically. The model uses voice activity detection to determine when you have finished speaking:

# Optional: Customize turn detection
"realtime_input_config": {
    "turn_detection": {
        "create_response": True,     # Auto-respond when user stops
        "threshold": 0.5,            # VAD sensitivity
        "silence_duration_ms": 700   # Silence before turn ends
    }
}

Complete Connection Implementation

Here is a complete implementation connecting to Gemini Live API:

import asyncio
import json
import base64
import os
from google import genai
from google.genai import types

class GeminiLiveConnection:
    """Direct connection to Gemini Live API."""

    def __init__(self):
        self.client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))
        self.session = None

    async def connect(
        self,
        voice: str = "Puck",
        instructions: str = "You are a helpful voice assistant."
    ):
        """Establish connection to Gemini Live API."""

        # Configure the model
        config = types.LiveConnectConfig(
            response_modalities=["AUDIO"],
            speech_config=types.SpeechConfig(
                voice_config=types.VoiceConfig(
                    prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name=voice
                    )
                )
            ),
            system_instruction=types.Content(
                parts=[types.Part(text=instructions)]
            )
        )

        # Connect via the Live API
        async with self.client.aio.live.connect(
            model="gemini-2.5-flash-native-audio-preview",
            config=config
        ) as session:
            self.session = session
            print(f"[gemini] Connected with voice: {voice}")

            # Start the conversation loop
            await self._conversation_loop()

    async def _conversation_loop(self):
        """Handle bidirectional audio streaming."""

        # Start microphone input task
        mic_task = asyncio.create_task(self._stream_microphone())

        # Process responses from Gemini
        try:
            async for response in self.session.receive():
                await self._handle_response(response)
        except asyncio.CancelledError:
            pass
        finally:
            mic_task.cancel()

    async def _stream_microphone(self):
        """Stream microphone audio to Gemini."""
        import sounddevice as sd

        # Audio capture settings
        sample_rate = 16000
        channels = 1
        chunk_duration = 0.1  # 100ms chunks

        def audio_callback(indata, frames, time_info, status):
            if status:
                print(f"[audio] {status}")

            # Convert to bytes and send
            audio_bytes = (indata * 32767).astype('int16').tobytes()
            asyncio.create_task(self._send_audio(audio_bytes))

        # Start streaming
        with sd.InputStream(
            samplerate=sample_rate,
            channels=channels,
            dtype='float32',
            callback=audio_callback,
            blocksize=int(sample_rate * chunk_duration)
        ):
            print("[gemini] Microphone streaming started")
            while True:
                await asyncio.sleep(0.1)

    async def _send_audio(self, audio_bytes: bytes):
        """Send audio chunk to Gemini."""
        if self.session:
            await self.session.send(
                input=types.LiveClientRealtimeInput(
                    media_chunks=[
                        types.Blob(
                            mime_type="audio/pcm",
                            data=base64.b64encode(audio_bytes).decode()
                        )
                    ]
                )
            )

    async def _handle_response(self, response):
        """Handle responses from Gemini."""

        # Check for audio data
        if hasattr(response, 'data') and response.data:
            # Play audio response
            audio_data = base64.b64decode(response.data)
            await self._play_audio(audio_data)

        # Check for text transcript
        if hasattr(response, 'text') and response.text:
            print(f"[agent] {response.text}")

        # Check for turn completion
        if hasattr(response, 'server_content'):
            if response.server_content.turn_complete:
                print("[gemini] Turn complete")

    async def _play_audio(self, audio_bytes: bytes):
        """Play received audio through speakers."""
        import sounddevice as sd
        import numpy as np

        # Convert from PCM16 bytes to numpy array
        audio = np.frombuffer(audio_bytes, dtype=np.int16)
        audio = audio.astype(np.float32) / 32767

        # Play through speakers
        sd.play(audio, samplerate=16000)


async def main():
    """Run a voice conversation with Gemini."""
    conn = GeminiLiveConnection()

    try:
        await conn.connect(
            voice="Puck",
            instructions="""
            You are a voice assistant for Task Manager.
            Help users check, create, and complete tasks.
            Keep responses under 30 words.
            Be friendly and conversational.
            """
        )
    except KeyboardInterrupt:
        print("\n[exit] Shutting down...")


if __name__ == "__main__":
    asyncio.run(main())

Output:

[gemini] Connected with voice: Puck
[gemini] Microphone streaming started

User: "What tasks do I have today?"
[agent] You have three tasks today: review the proposal, send invoices, and prepare for the team meeting.
[gemini] Turn complete

User: "Mark the invoices as done"
[agent] Done! I've marked 'send invoices' as complete. Two tasks remaining.
[gemini] Turn complete

Event Types Reference

Gemini Live API sends various event types:

Session Events

Event	Description
`setup_complete`	Connection established, ready for audio
`tool_call`	Function calling requested
`tool_call_cancellation`	Function call cancelled

Audio Events

Event	Description
`server_content.model_turn`	Model started responding
`server_content.turn_complete`	Model finished responding
`server_content.interrupted`	User interrupted model

Content Types

Type	Description
`audio`	Base64-encoded PCM16 audio
`text`	Transcript of audio content
`inline_data`	Image or other binary data

Comparison: OpenAI vs Gemini

Now that you have seen both APIs, compare the approaches:

Aspect	OpenAI Realtime	Gemini Live
Transport	WebRTC (complex)	WebSocket (simpler)
Audio rate	24kHz	16kHz
Browser support	Native RTCPeerConnection	Requires audio worklet
Server implementation	Harder (ICE, DTLS)	Easier (plain WebSocket)
Multimodal	Audio + limited vision	Audio + full vision
Affective dialog	Basic	Advanced
Proactive audio	No	Yes
Voice selection	10+ voices	30 HD voices
Language support	10+	24+ with translation

Choose OpenAI when: Browser-native is critical, you are already invested in OpenAI ecosystem, or you need mature function calling.

Choose Gemini when: Multimodal is essential, you need affective dialog, proactive audio matters, or you prefer simpler server implementation.

Try With AI

Prompt 1: Architecture Comparison

I've implemented voice agents with both OpenAI Realtime and Gemini Live API.
Help me understand the architectural differences:

1. Why does OpenAI use WebRTC while Gemini uses WebSocket?
2. What are the latency implications of each approach?
3. For a browser-based voice agent, which requires less client code?
4. My agent needs to see the user's screen while conversing. Which API?

Use diagrams to show the connection flow differences.

What you are learning: Protocol trade-offs. Understanding why each company made different architectural choices helps you select the right tool.

Prompt 2: Voice Selection Strategy

I'm building a voice agent for three scenarios:

Customer support for a luxury hotel (English, Spanish)
Financial advisor for retirees (needs to feel trustworthy)
Gaming companion for teenagers (needs energy)

From Gemini's 30 HD voices, help me:
Select appropriate voices for each scenario
Explain why each voice fits the audience
Consider language requirements
Suggest fallback voices if primary is unavailable

What you are learning: Voice personality design. The voice is your agent's first impression---matching it to audience expectations is critical for adoption.

Prompt 3: Migration Planning

I have an existing voice agent built with Pipecat using Deepgram STT + GPT-4 + Cartesia TTS.

I want to evaluate migrating to Gemini Live API native audio. Help me:

1. What functionality would I gain?
2. What would I lose (Pipecat features, provider flexibility)?
3. Estimate the latency difference
4. Estimate the cost difference for 10,000 4-minute calls/month
5. What's your recommendation?

My agent does customer support with occasional screen sharing requests.

What you are learning: Migration decision-making. Real projects must weigh gains against losses when adopting new technologies.

Safety Note

Direct API access means direct responsibility:

Content filtering: Gemini has built-in safety filters, but review your use case against Google's acceptable use policies. Some content categories may be restricted.
Audio persistence: Understand data retention. Google processes audio for the session; review your GDPR, HIPAA, or industry-specific obligations.
Rate limiting: The Live API has concurrent connection limits. Implement connection pooling for high-volume production use.
Cost monitoring: Native audio processing is billed by duration and token count. Implement usage tracking to avoid surprise bills.

Test thoroughly with representative conversations before exposing to real users.

Gemini 2.5 Flash Native Audio: What Changes​

What Makes Gemini Live Distinctive​

The Live API Protocol​

Connection Flow​

Key Differences from OpenAI Realtime​

Audio Format Requirements​

Why 16kHz Instead of 24kHz?​

Audio Conversion​

Voice Selection: 30 HD Voices​

English Voices​

Language Coverage​

Voice Selection Strategy​

Session Configuration​

Configuration Message​

Configuration Options​

Turn Detection​

Complete Connection Implementation​

Event Types Reference​

Session Events​

Audio Events​

Content Types​

Comparison: OpenAI vs Gemini​

Try With AI​

Prompt 1: Architecture Comparison​

Prompt 2: Voice Selection Strategy​

Prompt 3: Migration Planning​

Safety Note​