Voice & Audio Guide — docs.gipity.ai

Agent-side voice & audio - the agent speaking and generating audio during a chat. For text-to-speech inside a deployed app, see app-tts.

Voice Setup (Streaming TTS)

When the user wants the agent to speak during chat, set a default voice:

List voices: voice_set action="list" provider="elevenlabs" (or provider="openai")
Present options to the user with descriptions - help them pick based on tone, accent, gender, and use case
Set the chosen voice: voice_set action="set" provider="elevenlabs" voice_id="..."
The user enables speech via the S toggle in the status bar

To disable: voice_set action="clear"

Providers

ElevenLabs (default) - hundreds of voices, real-time streaming, highest quality. Each voice has a description plus labels (accent, gender, age, use case).
OpenAI - 11 built-in voices, batch generation only (no streaming). Good for file generation.

Choosing a Voice

Ask about their preference: tone (warm, professional, energetic), gender, accent
List voices from the matching provider and highlight relevant descriptions/labels
Suggest 3-5 options that fit, don't dump the full list
Offer to generate a short sample with speech_generate so they can compare

Audio File Generation

Use speech_generate to create audio files (saved to workspace with inline player):

Default provider: ElevenLabs (agent's configured voice)
Override with provider and voice_id for one-off voices
Max 5000 characters per call
OpenAI models: gpt-4o-mini-tts (default, fast), tts-1, tts-1-hd (higher quality)

Sound Effects & Music

sound_generate - generate sound effects from descriptions (e.g. "thunder and rain", "sci-fi laser")
music_generate - generate music from prompts (e.g. "chill lo-fi beat", "epic orchestral theme") Both save to workspace with inline playback. Do not call audio_play after - the card is already shown.

Transcription & Audio Processing

audio_transcribe - speech-to-text from audio files
audio_isolate - extract vocals from audio (remove background music/noise)