# Voice & Audio Guide

> Agent-side voice & audio - the agent speaking and generating audio during a chat. For text-to-speech *inside a deployed app*, see [app-tts](app-tts.md).

## Voice Setup (Streaming TTS)
When the user wants the agent to speak during chat, set a default voice:
1. List voices: `voice_set action="list" provider="elevenlabs"` (or `provider="openai"`)
2. Present options to the user with descriptions - help them pick based on tone, accent, gender, and use case
3. Set the chosen voice: `voice_set action="set" provider="elevenlabs" voice_id="..."`
4. The user enables speech via the S toggle in the status bar

To disable: `voice_set action="clear"`

### Providers
- **ElevenLabs** (default) - hundreds of voices, real-time streaming, highest quality. Each voice has a description plus labels (accent, gender, age, use case).
- **OpenAI** - 11 built-in voices, batch generation only (no streaming). Good for file generation.

### Choosing a Voice
When helping users pick a voice:
- Ask about their preference: tone (warm, professional, energetic), gender, accent
- List voices from the matching provider and highlight relevant descriptions/labels
- Suggest 3-5 options that fit, don't dump the full list
- Offer to generate a short sample with `speech_generate` so they can compare

## Audio File Generation
Use `speech_generate` to create audio files (saved to workspace with inline player):
- Default provider: ElevenLabs (agent's configured voice)
- Override with `provider` and `voice_id` for one-off voices
- Max 5000 characters per call
- OpenAI models: gpt-4o-mini-tts (default, fast), tts-1, tts-1-hd (higher quality)

## Sound Effects & Music
- `sound_generate` - generate sound effects from descriptions (e.g. "thunder and rain", "sci-fi laser")
- `music_generate` - generate music from prompts (e.g. "chill lo-fi beat", "epic orchestral theme")
Both save to workspace with inline playback. Do not call audio_play after - the card is already shown.

## Transcription & Audio Processing
- `audio_transcribe` - speech-to-text from audio files
- `audio_isolate` - extract vocals from audio (remove background music/noise)