App API - LLM Service — docs.gipity.ai

This is the app service a deployed app calls. For the agent making one-off cross-model queries during a chat, see query-llm.

Building a chatbot/assistant UI? Don't hand-roll the message thread, streaming, and input box on top of this service - add the chatbot kit. It drops in a persona, scope guardrails, static knowledge, and a streaming bubble/headless engine wired to this LLM service, no API keys. Use this raw service only when you need a non-chat LLM call.

The LLM service is available for every project with no setup (user_pays billing by default) - apps can call AI models immediately after deploy.

Use project_settings to change settings (optional):

Switch billing mode (user_pays ↔ owner_pays)
Restrict allowed models
Set max token cap
Disable the service entirely

Billing Modes

user_pays (default): Each app user pays from their own Gipity credits, so publishing an app never silently bills you for what your users consume - it's safe to share. Requires the caller to be signed in (Sign in with Gipity + LLM consent), so calls must come from the browser with a logged-in user. Load app-auth for the auth flow.
owner_pays: Your credits are consumed when anyone (incl. anonymous/logged-out users, and serverless functions) calls the LLM. Simpler - no user login needed. Switch to this to subsidize usage or to call the LLM from a serverless function (a function has no browser session, so a user_pays call from it throws - see below).

Set it in source, not just via the tool. project_settings flips billing as a one-off, but that lives as server state that does not travel with your app - rebuild into a fresh project or fork it and it reverts to user_pays. To make the choice reproducible, declare it in a services deploy phase in gipity.yaml so every deploy reproduces it (the same applies to the image, tts, transcribe, etc. services):

- name: services
  type: services
  service_definitions:
    - service: llm
      billing_mode: owner_pays

See deploy for the full services phase. Billing is project-level (one setting for dev and prod). The phase owns only the billing mode; model/token restrictions set via project_settings are preserved.

Billing modes only govern the deployed app's runtime calls. Direct calls during development - gipity service call, gipity generate <image|video|speech|sound|music>, gipity chat, or the agent's own tools - always bill the caller (you): a logged-in owner is an identified payer even under user_pays. Never flip a service to owner_pays just to generate assets or test an endpoint - it's unnecessary, and while flipped the live app accepts anonymous usage on your credits.

Configuration Options

allowed_models: Restrict which models apps can use (e.g. only allow cheap models)
max_tokens: Cap output tokens (default 4096)
default_model: Model used when app doesn't specify one (default: gpt-5.4-mini)

Choosing a Model

Recommended: use an alias, not a concrete id. Set model to a capability tier and we keep it pointed at the current best model in that tier for you — when the model catalog is refreshed, your app rides along automatically with no code change:

small → claude-haiku-4-5
fast → claude-haiku-4-5
medium → claude-sonnet-5
large → claude-opus-4-8
xlarge → claude-opus-4-8
thinking → claude-opus-4-8
haiku → claude-haiku-4-5
sonnet → claude-sonnet-5
opus → claude-opus-4-8
fable → claude-fable-5
sol → gpt-5.6-sol
terra → gpt-5.6-terra
luna → gpt-5.6-luna
gpt → gpt-5.6-sol
codex → gpt-5.6-sol

small/fast are the cheapest, lowest-latency tier; medium is the balanced default; large/xlarge/thinking are the strongest. Most apps should just pick fast or medium and forget about it.

Or pin a concrete id (from the list below) if you need a specific model and want to freeze its behavior. The trade-off: a pinned id is yours to maintain — if that model is later retired from the catalog, calls will return 400 Unknown model until you update it. Aliases never have this problem.

Available Models

claude-fable-5, claude-opus-4-8, claude-opus-4-7, claude-opus-4-6, claude-opus-4-5, claude-sonnet-5, claude-sonnet-4-6, claude-sonnet-4-5, claude-haiku-4-5, gpt-5.6-sol, gpt-5.6-terra, gpt-5.6-luna, gpt-5.5, gpt-5.5-pro, gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, gpt-5.4-nano, gemini-3.5-flash, gemini-3.1-pro-preview, gemini-3.1-flash-lite, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite

Endpoints

GET /api/<PROJECT_GUID>/services/llm/models - list available models
POST /api/<PROJECT_GUID>/services/llm - call the LLM

Request Format

The endpoint accepts OpenAI-compatible messages:

messages: Array of {role, content}. Roles: system, user, assistant. Max 20 messages.
prompt: Shorthand for a single user message (alternative to messages). Max 32,000 chars.
image: Base64 image with prompt ({ data, media_type }). Max 5 MB.
model: Override default model. Prefer an alias (fast, medium, large, …) over a concrete id — see Choosing a Model.
system_prompt: Custom system instructions (top-level field, takes precedence over system role messages).
temperature: 0-2 (creativity).
max_tokens: Output limit (capped at 4096).
stream: true for SSE streaming, false (default) for JSON response.

Image Support

Both formats are accepted in message content arrays:

// OpenAI format (image_url with data URI)
{ type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } }

// Native format
{ type: 'image', data: 'iVBOR...', media_type: 'image/png' }

Only data: URIs are supported - external image URLs will return a 400 error.

Response Format (OpenAI-compatible)

Non-streaming:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-5.4-mini",
  "choices": [{ "index": 0, "message": { "role": "assistant", "content": "..." }, "finish_reason": "stop" }],
  "usage": { "prompt_tokens": 100, "completion_tokens": 50, "total_tokens": 150 },
  "provider": "anthropic",
  "credits_used": 5
}

Streaming (SSE):

Content chunks: data: {"choices":[{"delta":{"content":"..."}}]}
Final chunk: finish_reason: "stop" with usage and credits_used
Terminator: data: [DONE]

Client Code Example (Non-Streaming)

IMPORTANT: The token endpoint is on the API server, NOT the app host. You MUST use the absolute URL https://a.gipity.ai/api/token - never a relative path like /api/token. It is a POST request and the token is nested under data.

// 1. Get app token - MUST be absolute URL to API server, POST with app GUID
const tokenRes = await fetch('https://a.gipity.ai/api/token', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ app: '<PROJECT_GUID>' })
});
const { data: { token } } = await tokenRes.json();
// ✗ WRONG: fetch('/api/token')           - relative URL hits app host, not API
// ✗ WRONG: const { token } = await ...   - token is inside data: { data: { token } }

// 2. Call the LLM
const res = await fetch('https://a.gipity.ai/api/<PROJECT_GUID>/services/llm', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-App-Token': token },
  body: JSON.stringify({
    messages: [
      { role: 'system', content: 'Answer concisely.' },
      { role: 'user', content: 'What is the capital of France?' }
    ],
    model: 'fast'   // alias — rides the catalog refresh. Or pin a concrete id.
  })
});
const data = await res.json();
const answer = data.choices[0].message.content; // "The capital of France is Paris."

Handling out of credits (402) - don't fail silently

When the paying user has no credits left, the call returns HTTP 402 with { error: { code: "INSUFFICIENT_CREDITS", message } } - the message carries a top-up link. Always branch on res.ok and show the user something; an LLM feature that just does nothing on 402 is the #1 cause of a confusing "it broke and I don't know why" wall.

const res = await fetch('https://a.gipity.ai/api/<PROJECT_GUID>/services/llm', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-App-Token': token },
  body: JSON.stringify({ prompt: 'Summarize this note', model: 'fast' })
});

if (!res.ok) {
  const { error } = await res.json();
  if (res.status === 402) {
    // Out of credits. Under user_pays (the default) it's THIS signed-in user who's
    // out - show them the message (it includes the top-up link) instead of a blank.
    showNotice(error.message); // e.g. "Insufficient credits. Buy more at https://prompt.gipity.ai/pricing"
    return;
  }
  if (res.status === 401) { /* not signed in - send them through Sign in with Gipity (app-auth) */ }
  throw new Error(error?.message || 'LLM request failed');
}
const data = await res.json();

Streaming requests get a clean 402 too. The credit check runs before the stream opens, so an out-of-credits streaming call comes back as a normal 402 JSON body - not a mid-stream error chunk. Check res.ok before you start reading the SSE body.
Who's out of credits depends on billing mode. Under user_pays (default) it's the signed-in end user, who tops up their own Gipity credits at the link. Under owner_pays it's you, the app owner - run gipity credits buy (or top up at the pricing page) to refill.
What still works at zero credits: location and notify are not credit-gated, so those keep working - only the AI/media services (llm, image, video, tts, audio, transcribe) stop. So "out of credits" degrades one feature, it doesn't break the whole app.

Calling from a serverless function (server-side)

Inside a function, don't fetch this endpoint - a.gipity.ai is blocked from the sandbox. The LLM is injected as a capability on the function's second argument, called server-to-server with no token and no fetch:

export default async function ask(ctx, { llm }) {
  const { text, usage, credits_used } = await llm({
    model: 'fast',   // alias — rides the catalog refresh. Or pin a concrete id.
    messages: [{ role: 'user', content: ctx.body.question }],
    // same fields as the HTTP body: prompt | messages, system_prompt, temperature, max_tokens, image
  });
  return { answer: text };
}

- name: ask
  auth: public
  services: ['llm']      # required - without it, llm() throws

Returns { text, model, provider, usage: { input_tokens, output_tokens }, credits_used } (non-streaming only - functions return JSON, so there's no stream option here).
Calling the LLM from a function requires owner_pays. The default is now user_pays, and a function has no browser session for user_pays consent, so an LLM call from a function throws until you switch the llm service to owner_pays (via project_settings). Either set owner_pays, or make the LLM call from the browser with a signed-in user.
Capped at max_llm_calls per invocation (default 5). See app-development for the full injected-services list.

Client Code Example (Streaming)

const res = await fetch('https://a.gipity.ai/api/<PROJECT_GUID>/services/llm', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-App-Token': token },
  body: JSON.stringify({
    messages: [{ role: 'user', content: 'Write a story' }],
    stream: true
  })
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n\n');
  buffer = lines.pop();
  for (const line of lines) {
    if (!line.startsWith('data: ')) continue;
    const raw = line.slice(6);
    if (raw === '[DONE]') break;
    const chunk = JSON.parse(raw);
    const content = chunk.choices?.[0]?.delta?.content;
    if (content) process.stdout.write(content);
    if (chunk.choices?.[0]?.finish_reason === 'stop') {
      console.log('\nUsage:', chunk.usage);
    }
  }
}

Image Description Example

const res = await fetch('https://a.gipity.ai/api/<PROJECT_GUID>/services/llm', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-App-Token': token },
  body: JSON.stringify({
    messages: [{
      role: 'user',
      content: [
        { type: 'text', text: 'Describe this image' },
        { type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } }
      ]
    }]
  })
});
const data = await res.json();
const description = data.choices[0].message.content;

Limits

Rate limit: 600 requests per 5-minute window (per IP)
Max messages: 20 per request
Max prompt length: 32,000 chars
Max output tokens: 4096
Max image size: 5 MB (base64)
Timeout: 60s
Standard RateLimit-* headers included in responses

Testing

The LLM service is tested end-to-end: an E2E test asks the agent to build an app that calls the LLM, deploys it, then verifies the page renders the correct AI response in a headless browser.