Agent-side voice & audio - the agent speaking and generating audio during a chat. For text-to-speech inside a deployed app, see app-tts.
Voice Setup (Streaming TTS)
When the user wants the agent to speak during chat, set a default voice:
- List voices:
voice_set action="list" provider="elevenlabs"(orprovider="openai") - Present options to the user with descriptions - help them pick based on tone, accent, gender, and use case
- Set the chosen voice:
voice_set action="set" provider="elevenlabs" voice_id="..." - The user enables speech via the S toggle in the status bar
To disable: voice_set action="clear"
Providers
- ElevenLabs (default) - hundreds of voices, real-time streaming, highest quality. Each voice has a description plus labels (accent, gender, age, use case).
- OpenAI - 11 built-in voices, batch generation only (no streaming). Good for file generation.
Choosing a Voice
When helping users pick a voice:
- Ask about their preference: tone (warm, professional, energetic), gender, accent
- List voices from the matching provider and highlight relevant descriptions/labels
- Suggest 3-5 options that fit, don't dump the full list
- Offer to generate a short sample with
speech_generateso they can compare
Audio File Generation
Use speech_generate to create audio files (saved to workspace with inline player):
- Default provider: ElevenLabs (agent's configured voice)
- Override with
providerandvoice_idfor one-off voices - Max 5000 characters per call
- OpenAI models: gpt-4o-mini-tts (default, fast), tts-1, tts-1-hd (higher quality)
Sound Effects & Music
sound_generate- generate sound effects from descriptions (e.g. "thunder and rain", "sci-fi laser")music_generate- generate music from prompts (e.g. "chill lo-fi beat", "epic orchestral theme") Both save to workspace with inline playback. Do not call audio_play after - the card is already shown.
Transcription & Audio Processing
audio_transcribe- speech-to-text from audio filesaudio_isolate- extract vocals from audio (remove background music/noise)