Skip to main content

Give It a Voice

James scrolled through the heartbeat log from Lesson 8. His agent had checked three items on the morning checklist, sent a summary to WhatsApp, and gone quiet with HEARTBEAT_OK. Everything worked. But the message sat there as text, and he was thinking about his old operations team.

"When I managed the warehouse crew," he said, "half of them never read the group chat messages. They listened to voice notes while driving forklifts."

Emma looked up. "Two config lines and one plugin enable. Sixty seconds."

"Sixty seconds? I spent three hours on heartbeats yesterday."

"Voice is easier than scheduling. The hard part is knowing when NOT to speak." She stood up. "Enable it. Make it say something. When I get back, tell me what annoyed you."


You are doing exactly what James is doing. Enable voice, hear every reply as audio, and discover for yourself what annoyed Emma's previous students enough to switch modes.

Your agent acts on its own schedule now, checking tasks every 30 minutes and delivering messages on cron. It does not wait for you. But every response is text. Now it learns to speak. By the end of this lesson, your agent will send voice notes on WhatsApp, and you will understand why letting the agent choose when to speak produces better results than forcing every reply into audio.

Three Providers, One Interface

OpenClaw bundles four text-to-speech providers behind a unified interface. All produce Opus-encoded OGG audio (48kHz, 64kbps), the exact format WhatsApp uses for push-to-talk voice messages. Your agent's replies appear as playable voice notes, not file attachments.

ProviderKey RequiredQualityCostVoices
Microsoft EdgeNoGoodFree300+ neural
OpenAI TTSYes (OPENAI_API_KEY)Excellent~$15/M chars6
ElevenLabsYes (ELEVENLABS_API_KEY)PremiumFree tier availableThousands
MiniMaxYes (MINIMAX_API_KEY)GoodFree tier availableMultiple

Microsoft Edge TTS uses the same backend as the Edge browser's "Read Aloud" feature. No API key. No signup. No cost. Start here.

One important limit: replies longer than 1,500 characters are either auto-summarized before synthesis or skipped entirely. If your agent writes long replies and you hear silence, the text exceeded this limit. You can adjust it with /tts limit 3000 on WhatsApp or check the current setting with /tts status.

Enable Voice Output

Three commands:

openclaw config set messages.tts.auto always
openclaw config set messages.tts.provider microsoft
openclaw config set plugins.entries.microsoft.enabled true

Restart the gateway:

openclaw gateway restart

Before sending a test message, verify the setup on WhatsApp:

/tts status

You should see: State: enabled, Provider: microsoft (configured). If the provider shows (not configured), the plugin did not load. Run openclaw gateway restart again and recheck.

Now send a message on WhatsApp. Your agent's reply arrives as a playable voice note.

If No Voice Note Arrives

With always mode, every reply must go through TTS conversion. If the TTS pipeline is not ready (provider not configured, gateway still starting up, WhatsApp reconnecting), replies are silently dropped. Check /tts status on WhatsApp first. If it shows the provider as configured but replies still do not arrive, check the gateway log at ~/.openclaw/logs/gateway.log for errors.

James stared at the voice note playing on his screen. "That's it? After the crash loop, the auth cache, the tool profiles, three hours on heartbeat configs... voice is three commands and a restart?"

"Sixty seconds," Emma said from the doorway. "I told you."

The Activation Dance

Every OpenClaw capability follows the same four steps:

  1. Bundled plugin exists (check: openclaw plugins list)
  2. Disabled by default (security: nothing auto-activates)
  3. Enable: openclaw config set plugins.entries.<id>.enabled true
  4. Configure the feature-specific settings

You first saw this in Lesson 2 (installation), again in Lesson 6 (skills). The speech plugin follows the same dance. By Lesson 13, you will write a plugin that other people activate through this same pattern.

Verify the plugin loaded:

openclaw plugins list --verbose

Look for microsoft in the list with status loaded. If it shows disabled, the config entry was not picked up. Restart the gateway and check again.

Four TTS Modes

The messages.tts.auto setting controls who decides when the agent speaks:

ModeWho DecidesBehavior
offNobodyText only (default)
alwaysConfigEvery reply becomes a voice note
inboundCustomerVoice reply only when the customer sends a voice message
taggedThe AgentTTS fires only when the model includes [[tts]] in reply

Why always Gets Annoying Fast

With always mode, every single reply is audio. A one-word confirmation ("Done.") becomes a voice note. A list of five items becomes a voice note. A booking confirmation with a reference number the customer needs to copy becomes a voice note. The customer cannot copy text from audio.

always mode proves the pipeline. It is not a production setting.

Why inbound Is the Smart Production Default

In inbound mode, the agent matches the customer's modality. If the customer sends text, the agent replies with text. If the customer sends a voice note, the agent replies with a voice note. No SOUL.md configuration needed. The gateway handles it automatically.

Switch to Inbound Mode

openclaw config set messages.tts.auto inbound

No gateway restart needed for this change. The gateway applies messages.tts config dynamically.

Send a text message. You get text back. Send a voice note. You get a voice note back. The agent adapts to whatever the customer prefers.

About Tagged Mode

A fourth mode, tagged, lets the agent decide when to speak by including [[tts]] tags in its replies. In theory, this is the most flexible option: voice for descriptions, text for confirmations. In practice, the [[tts]] tags often appear as literal text in the chat instead of triggering synthesis. Until this is resolved, inbound is the reliable production choice.

Upgrading to OpenAI TTS

Microsoft Edge proves the pipeline. For production voice quality, switch to OpenAI with one config change:

openclaw config set messages.tts.provider openai

The OpenAI provider supports an instructions field for voice character:

{
messages: {
tts: {
auto: "tagged",
provider: "openai",
providers: {
openai: {
model: "gpt-4o-mini-tts",
voice: "coral",
instructions: "Speak in a warm, professional tone",
},
},
},
},
}

At roughly $0.015 per 1,000 characters, a typical message costs less than a tenth of a cent to voice.

The Modality Design Principle

Voice and text are not interchangeable. Each has strengths:

Voice Works Best ForText Works Best For
Descriptions, summariesReference numbers, links, code
Emotional, persuasive contentLists the customer needs to copy
Hands-busy users (driving)Search-friendly content
Long-form explanationsShort confirmations

The inbound mode handles this automatically by matching the customer's modality: if they send a voice note, reply with a voice note. If they type, reply with text. This is the safe production default.

tagged mode goes further. The agent evaluates its own response and decides whether voice or text serves the content better. The agent becomes the UX designer.

Try With AI

Exercise 1: Hear Your Agent Speak

Enable Microsoft Edge TTS with always mode if you have not already:

openclaw config set messages.tts.auto always
openclaw config set messages.tts.provider microsoft
openclaw config set plugins.entries.microsoft.enabled true
openclaw gateway restart

Send any message on WhatsApp. Your agent's reply should arrive as a playable voice note.

What you are learning: The TTS pipeline converts text to Opus-encoded OGG and delivers it as a WhatsApp push-to-talk message. The channel adapter handles codec selection automatically. You configured three settings; the platform handled encoding, formatting, and delivery.

Exercise 2: Experience the Annoyance

With always mode still active, send these three messages in sequence:

1. Tell me about the benefits of AI agents for small businesses
2. OK
3. What is 2 + 2?

All three replies come as voice notes. Message 1 makes sense as audio. Messages 2 and 3 do not.

What you are learning: Blanket voice output degrades the user experience for short, functional replies. The right question is not "should the agent speak?" but "when should it speak?"

Exercise 3: Let the Agent Decide

Switch to tagged mode and add voice instructions to SOUL.md:

openclaw config set messages.tts.auto tagged

Add to your SOUL.md:

## Voice Output Rules

Use [[tts]] at the end of your reply when giving descriptions,
explanations, or detailed answers. Use text only for confirmations,
short answers, and anything containing numbers or links the user
might need to copy.

Send the same three messages again. Does the agent choose voice for the description and text for the short answers?

What you are learning: tagged mode delegates modality decisions to the agent. The quality of those decisions depends on the instructions in SOUL.md and the capability of the underlying model. You are designing the agent's communication style, not just its knowledge.


When Emma came back, James had his phone playing a voice note. The agent was describing a property listing in a warm, measured tone. The previous three messages in the chat were text: "Done.", "Confirmed.", and a booking reference number.

"You switched to tagged mode," Emma said. It was not a question.

"I lasted about four messages on always mode before I wanted to throw the phone." He held it up. "Every reply was audio. Even 'OK.' That is not useful."

"So what did you change?"

"Added rules to SOUL.md. Descriptions get voice. Confirmations get text. The agent picks." He paused. "It is basically the same thing I did with the warehouse crew. Voice notes for updates, text messages for part numbers."

Emma nodded slowly. "The agent is the UX designer now. Not the config file." She glanced at the caution block earlier in the lesson. "Tagged mode on secondary agents is where I am least confident. The caution in the docs is real."

James looked at the WhatsApp thread. Voice for descriptions, text for confirmations. One agent handling both. "What happens when the workload splits? Right now one agent does everything. Customer questions and my internal operations go through the same queue."

"Same problem as one receptionist handling walk-ins and phone calls at the same time," Emma said. "Lesson 10. You add a second agent."

Flashcards Study Aid