App API - Text-to-Speech Service

This is the app service a deployed app calls over HTTPS. For the agent generating speech during a chat, see tts.

TTS is available for every project with no setup needed (user_pays billing by default).

Plan availability: Speech and sound-effect generation share one 3-uses-per-month allowance on the Free plan (combined), then unlimited on Pro. When a Free-plan owner exceeds the cap, this endpoint returns 403 FORBIDDEN with an upgrade message - handle that response in the app's UI.

Use project_settings to customize (optional):

Switch billing mode (owner_pays ↔ user_pays)
Set default provider and voice

To make the billing choice ship with the app (reproduced on every deploy instead of living as out-of-band server state), declare it in a services deploy phase in gipity.yaml instead of only via project_settings - e.g. { service: tts, billing_mode: owner_pays }. See deploy and app-llm.

Billing modes only govern the deployed app's runtime calls. Direct generation during development - gipity generate speech, gipity service call tts, or the agent's own speech tools - always bills the caller (you), whatever the service's billing_mode says. Never flip a service to owner_pays just to generate assets: it's unnecessary, and while flipped the live app accepts anonymous generation on your credits.

Providers

elevenlabs: ElevenLabs (many voices - use voice_set list to discover)
openai: OpenAI (alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse)
gemini: Gemini (30 voices: Kore, Puck, Zephyr, Charon, Fenrir, Leda, Orus, Aoede, and 22 more). Multi-speaker (up to 2) and 60+ languages

Endpoints

GET /api/<PROJECT_GUID>/services/tts/voices - list available voices
POST /api/<PROJECT_GUID>/services/tts - generate speech audio

Listing Voices

GET /api/<PROJECT_GUID>/services/tts/voices?provider=elevenlabs

Returns { data: { voices: [...], provider, available_providers } }

Request Format (POST /tts)

{
  "text": "Hello, welcome to our app!",
  "voice_id": "JBFqnCBsd6RMkjVDRZzb",
  "provider": "elevenlabs",
  "model": "eleven_flash_v2_5"
}

Fields:

text (required): Text to speak. Per-provider cap (see Limits): OpenAI 4096 chars, ElevenLabs/Gemini 5,000. Over the cap you get a clear 400 - for long text, chunk by sentence and synthesize each (see "Reading long text" below), don't send one giant request.
voice_id: Voice to use (default depends on provider - OpenAI alloy, ElevenLabs dfeOmy6Uay63tNhyO99j / Kristen)
provider: "openai" (default - cheapest), "elevenlabs", or "gemini"
model: Provider-specific model ID (optional)
language: BCP-47 language code (Gemini only, e.g. "ja-JP", "es-ES"). 60+ languages
speakers: Multi-speaker config (Gemini only, up to 2). Array of { name, voice }. Text must use "Name: dialogue" format per line
include_timestamps: true to also get word-level timing for read-along / word highlighting (see "Read-along" below). Routes to ElevenLabs automatically (only it returns timing); pinning provider:"openai" + include_timestamps is a 400.

Gemini TTS Details

30 voices: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Algenib, Rasalgethi, Laomedeia, Achernar, Alnilam, Schedar, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat

Multi-speaker example (up to 2 speakers):

{
  "text": "Joe: Hey, how are you?\nJane: Great, thanks!",
  "provider": "gemini",
  "speakers": [
    { "name": "Joe", "voice": "Charon" },
    { "name": "Jane", "voice": "Leda" }
  ]
}

Language example (Japanese):

{
  "text": "こんにちは世界",
  "provider": "gemini",
  "voice_id": "Kore",
  "language": "ja-JP"
}

Output format is raw PCM audio (audio/L16, 24kHz). The platform converts and serves as MP3.

Response Format

{
  "url": "https://media.gipity.ai/med_abc12345.mp3",
  "voice_id": "JBFqnCBsd6RMkjVDRZzb",
  "provider": "elevenlabs",
  "credits_used": 5
}

The url is a permanent public CDN URL to an MP3 file.

CLI

For one-off speech during development (downloads the result to a local file), skip the HTTP call and use gipity generate speech. It writes to ./speech.mp3 by default - pass -o <path> to land the file in your source tree so it deploys.

gipity generate speech "Welcome to Gipity!" -o src/assets/sounds/intro.mp3
gipity generate speech "こんにちは世界" --provider gemini --voice Kore --language ja-JP -o src/assets/sounds/greeting.mp3

Client Code Example

const tokenRes = await fetch('https://a.gipity.ai/api/token', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ app: '<PROJECT_GUID>' })
});
const { data: { token } } = await tokenRes.json();

const res = await fetch('https://a.gipity.ai/api/<PROJECT_GUID>/services/tts', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-App-Token': token },
  body: JSON.stringify({ text: 'Welcome to the future of AI!' })
});
const data = await res.json();

// Play audio
const audio = new Audio(data.url);
audio.play();

Always show a loading state

Synthesis takes a second or more, and reading long text means several requests. Never leave the button dead - flip it to a spinner / "Loading…" the moment it's tapped and back when audio starts. A silent delay reads as "broken." (Same goes for the LLM call that produced the text.)

Reading long text (chunk by sentence)

A single request is capped per provider (OpenAI 4096, ElevenLabs/Gemini 5,000). Anything longer - a story, an article, a multi-paragraph reply - must be split into sentence-sized chunks, each synthesized and played in order. This removes the ceiling entirely and lets playback start on the first chunk while later ones synthesize:

// Split into chunks that stay under the cap (sentence boundaries, not mid-word).
function chunkForTts(text, max = 600) {
  const sentences = text.match(/[^.!?]+[.!?]+|\s*\S+\s*$/g) || [text];
  const chunks = []; let cur = '';
  for (const s of sentences) {
    if ((cur + s).length > max) { if (cur) chunks.push(cur.trim()); cur = s; }
    else cur += s;
  }
  if (cur.trim()) chunks.push(cur.trim());
  return chunks;
}

async function readAloud(text, appGuid, token) {
  for (const chunk of chunkForTts(text)) {
    const res = await fetch(`https://a.gipity.ai/api/${appGuid}/services/tts`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json', 'X-App-Token': token },
      body: JSON.stringify({ text: chunk }),
    });
    const { url } = await res.json();
    const audio = new Audio(url);
    await new Promise((resolve, reject) => { audio.onended = resolve; audio.onerror = reject; audio.play().catch(reject); });
  }
}

Read-along (word highlighting)

For karaoke-style highlighting of the word currently being spoken, pass include_timestamps: true. The response adds a words array of { word, start, end } (seconds). Highlight by matching the audio's currentTime against the ranges - no extra processing, no GPU job:

{
  "url": "https://media.gipity.ai/med_abc12345.mp3",
  "words": [ { "word": "Once", "start": 0.0, "end": 0.31 }, { "word": "upon", "start": 0.31, "end": 0.52 } ],
  "voice_id": "dfeOmy6Uay63tNhyO99j",
  "provider": "elevenlabs",
  "credits_used": 5
}

const { url, words } = await (await fetch(ttsUrl, { /* ...include_timestamps: true... */ })).json();
const audio = new Audio(url);
audio.ontimeupdate = () => {
  const t = audio.currentTime;
  const i = words.findIndex(w => t >= w.start && t < w.end);
  highlightWord(i); // toggle a class on your per-word <span>s
};
audio.play();

Combine with chunking for long text: request each chunk with include_timestamps: true and offset each chunk's word times by the running audio duration.

Word highlighting: which tool? Use include_timestamps (above) for on-demand TTS you generate in the app - it's instant, free, and exact for synthesized speech. The separate audio-align kit is different: it's a Gipity Jobs GPU batch job (~tens of seconds + cost per run) that force-aligns lyrics to a pre-existing audio file (e.g. a song you uploaded). Reach for the kit only when you have audio you didn't synthesize here; for read-aloud of generated text, include_timestamps is the right call.

Limits

Rate limit: 600 requests per 5-minute window (per IP)
Max text length (per request): OpenAI 4096 chars, ElevenLabs/Gemini 5,000. Over the cap → 400 with the exact limit; chunk by sentence for longer content.
Timeout: 60s
Standard RateLimit-* headers included in responses