GammaInfra API documentation
Last updated: 20 May 2026
On this page
Overview
GammaInfra is an intelligent LLM routing engine. The router classifies every prompt by task and dispatches to the best-fit model across every major LLM provider — delivered through one API.
If you already use OpenAI or any OpenAI-compatible SDK, switching to GammaInfra is a one-line change — update the base_url, keep everything else the same.
https://gammainfra.comDashboard: dashboard.gammainfra.com · Status: status.gammainfra.com · Sign up: gammainfra.com/#signup
Quickstart
1. Get an API key
Sign up at gammainfra.com — email + password, one-time verification link, no credit card. New accounts come with $3.00 of free credit, enough to try the router end-to-end. Your API key is shown once after you click the verification link — store it somewhere safe. Need more keys later? Issue and revoke them from the dashboard.
2. Make your first call
curl -s -X POST https://api.gammainfra.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-gammainfra-..." \
-d '{
"model": "gammainfra/auto",
"messages": [{"role": "user", "content": "Explain transformers in one paragraph."}]
}'
The response is identical to OpenAI’s format. gammainfra/auto lets the router pick the best model for your prompt.
3. Drop-in replacement
from openai import OpenAI
client = OpenAI(
api_key="sk-gammainfra-...",
base_url="https://api.gammainfra.com/v1",
)
response = client.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
The same pattern works with LangChain, LlamaIndex, and any other OpenAI-compatible library.
Ecosystem compatibility
If your code targets an OpenAI-compatible router, the migration is usually two strings — base_url and api_key. Both /v1/* and /api/v1/* prefixes are mounted with identical responses, so SDKs that hard-code either path keep working unchanged.
from openai import OpenAI
client = OpenAI(
api_key="sk-gammainfra-...",
base_url="https://gammainfra.com/api/v1",
)
Body fields
GammaInfra accepts the common ecosystem-extension request fields. The ones below change behavior; everything else is accepted silently for forward compatibility.
| Field | Behavior |
|---|---|
models: [str, ...] | Honored. Becomes the authoritative fallback chain — tried in order, no auto-router. Fails loud (503) on exhaustion rather than silently picking another model. |
provider.sort | Honored. "price" → cost-optimized routing, "throughput"/"latency" → latency-optimized. |
provider.only / .ignore | Honored. Filter the candidate provider set. |
provider.order | Honored. Listed providers tried first; rest keep their relative order. |
provider.allow_fallbacks: false | Honored. Returns 503 on the first provider failure instead of trying the next candidate. |
provider.max_price | Honored. {prompt, completion} in USD per 1M tokens. Endpoints exceeding either cap are skipped. |
reasoning: {effort, ...} | Honored. effort translates to reasoning_effort for the GPT-5 family. Other providers drop it harmlessly. |
stream_options.include_usage | Honored on OpenAI; forwarded best-effort elsewhere. |
tool_choice, response_format, parallel_tool_calls, seed, top_k, min_p, top_a, repetition_penalty, logprobs, top_logprobs, user | Forwarded to providers that support them. |
transforms, route | Accepted silently (we always cascade through the fallback chain on failure). |
plugins: [{id: "web"}] | 501 web_plugin_unsupported. The :online model variant is also rejected (400). |
Model name aliases
GammaInfra uses the conventional vendor/slug format and accepts the common ecosystem aliases directly:
| Input | Behavior |
|---|---|
Third-party router vendor/auto aliases | Mapped to gammainfra/auto for migration compatibility. |
vendor/model:nitro | Suffix stripped, preference forced to latency. |
vendor/model:floor | Suffix stripped, preference forced to cost. |
vendor/model:exacto | Suffix stripped (GammaInfra quality-sorts by default). |
vendor/model:online | 400 web_plugin_unsupported. |
vendor/model:free | 400 free_tier_unavailable. New accounts get $3.00 of free balance on signup — see Balance & pricing. |
meta-llama/llama-3.3-70b-instruct | Routed via Groq (groq/llama-3.3-70b-versatile). |
meta-llama/llama-3.1-8b-instruct | Routed via Groq (groq/llama-3.1-8b-instant). |
Unknown :suffix variants are stripped silently for forward-compat. For the full catalogue of native model IDs, see Model names below.
Compatibility endpoints
The auxiliary endpoints SDKs commonly call are mounted under both prefixes and return ecosystem-compatible JSON.
| Endpoint | Returns |
|---|---|
GET /api/v1/credits | {data: {total_credits, total_usage}} — lifetime top-ups and lifetime spend in USD. |
GET /api/v1/generation?id=<request_id> | Post-hoc stats for a previous request: tokens, cost, latency. Use the X-GammaInfra-Request-Id response header value as the id. Customer-scoped (no cross-account lookups). |
GET /api/v1/key | Info about the calling key — label, name, lifetime usage. Per-key spend limits return null (not yet supported). |
GET /api/v1/models/{author}/{slug}/endpoints | Endpoint listing for a single model. GammaInfra routes each model through one provider, so the array has one entry. |
POST /api/v1/completions | Legacy text completion. Internally wraps your prompt as a chat message and rewrites the response to the text_completion shape (choices[*].text instead of choices[*].message.content). Streaming supported. |
GET /api/v1/models | Catalogue with both GammaInfra-native fields (input_cost_per_1k, etc.) and ecosystem-shaped fields (pricing.{prompt,completion} per-token strings, context_length, architecture, supported_parameters, top_provider). |
Headers
Send HTTP-Referer and X-Title on every request — GammaInfra stores them with each request log so per-app analytics work consistently. Both are best-effort and entirely optional.
What's not supported
- Web search (
plugins:[{id:"web"}]and the:onlinevariant) — not implemented. We reject explicitly so your prompts don't silently degrade. - Free tier (
:free) — we don't offer one. Every account starts with $3.00 of free balance; top up from /billing/checkout. - Per-key spend limits (
limit,limit_remaining,limit_reset) — not yet implemented;/keyreturnsnullfor these fields. - OAuth/PKCE key exchange — sign in via the dashboard and copy a key.
Authentication
Every authenticated request carries a Bearer token. The public endpoints are /v1/models, /v1/status, /health, /ready, and the signup/login routes; everything else needs a valid key.
Authorization: Bearer sk-gammainfra-...
Keys are prefixed sk-gammainfra-. The plaintext is only returned on creation — GammaInfra stores a bcrypt hash. Create additional keys or revoke old ones from the dashboard.
| Status | Meaning |
|---|---|
401 | Missing or invalid API key |
402 | Insufficient credits — top up and retry |
Smart routing
Send model: "gammainfra/auto" and the router classifies your prompt into one of 8 task labels (plus 2 deterministic capability flags), then dispatches to the best-fit model for that type. If the primary model is unavailable, GammaInfra falls back through a chain of 3–4 models automatically.
Capability flags are decided up-front from the request body, never from prompt text:
| Flag | When it fires |
|---|---|
tool_use | Request body has a non-empty tools array |
multimodal | Any message contains an image_url part |
For everything else the prompt text is classified into one of these 8 labels:
| Task label | When it fires |
|---|---|
reasoning | Multi-step analysis, math, root-cause questions (e.g. analyse, compare, evaluate, prove, probability, root cause, step-by-step) |
code | Code generation, debugging, refactoring (e.g. function, implement, debug, regex, unit test) |
creative | Original generative writing (e.g. poem, story, essay, brainstorm, lyrics) |
rewrite | Edit or transform existing text while preserving meaning |
extraction | Pull structured fields out of text (e.g. sentence-start extract, parse, list all, classify, format as json) |
summarize | Compress text (e.g. sentence-start summarize, tldr, key points, brief, condense) |
translation | Cross-language conversion (e.g. sentence-start translate, in/to spanish/french/japanese/…) |
chat | Default when nothing else matches |
X-GammaInfra-Preference: quality (default), cost, or latency to bias the router.Want finer control? Send a continuous
X-GammaInfra-Cost-Quality: 0.0 (pure quality) … 1.0 (pure cost) header and GammaInfra will place you on that axis. The server echoes X-GammaInfra-Cost-Quality-Applied on the response so you can log exactly what landed. An explicit X-GammaInfra-Preference: latency always wins over the cost-quality dial.Want to opt out? Send
X-GammaInfra-Routing: off and GammaInfra will route straight to the exact model you named in model.
Model names
Smart aliases (recommended)
| Model name | Behaviour |
|---|---|
gammainfra/auto | Picks the best-fit model for your prompt type |
gammainfra/fast | Optimises for lowest latency (equivalent to X-GammaInfra-Preference: latency) |
gammainfra/cheap | Optimises for lowest cost (equivalent to X-GammaInfra-Preference: cost) |
Bare model names (logical)
Type a bare model name and GammaInfra's router picks the best endpoint that serves it. Useful when the same model is hosted by more than one provider (e.g. Claude Opus is reachable via the native Anthropic API and Amazon Bedrock).
claude-opus-4-7
claude-opus-4-6
claude-sonnet-4-6
claude-haiku-4-5
nova-pro
nova-2-lite
gpt-5-mini
gpt-5.4-mini
deepseek-v4-pro
mistral-large-2512
llama-3.3-70b-versatile
gemini-3.1-pro-preview
grok-4-1-fast-non-reasoning
Bare names that aren't in the registry return 404 model_not_found (no silent fallback). Use X-GammaInfra-Routing: literal to disable cross-host routing and pin the first registered endpoint instead.
Pin a specific model
Prefix any model with its provider slug:
openai/gpt-5.4
openai/gpt-5.4-mini
openai/gpt-5.4-nano
openai/gpt-5-mini
anthropic/claude-opus-4-6
anthropic/claude-sonnet-4-6
anthropic/claude-haiku-4-5
google/gemini-3.1-pro-preview
google/gemini-3-flash-preview
google/gemini-2.5-pro
google/gemini-2.5-flash
mistral/mistral-large-2512
mistral/mistral-small-2603
mistral/codestral-2508
mistral/devstral-2512
groq/llama-3.3-70b-versatile
groq/llama-3.1-8b-instant
groq/qwen/qwen3-32b
deepseek/deepseek-v4-pro
deepseek/deepseek-v4-flash
# Legacy V3 slugs — still routable via direct pin, retire 2026-07-24:
# deepseek/deepseek-chat, deepseek/deepseek-reasoner
grok/grok-4.20-0309-reasoning
grok/grok-4-1-fast-reasoning
grok/grok-4-1-fast-non-reasoning
bedrock/us.anthropic.claude-opus-4-7
bedrock/us.anthropic.claude-opus-4-6-v1
bedrock/us.anthropic.claude-sonnet-4-6
bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
bedrock/meta.llama3-70b-instruct-v1:0
bedrock/mistral.mistral-large-2402-v1:0
bedrock/us.amazon.nova-pro-v1:0
bedrock/us.amazon.nova-2-lite-v1:0
Note on Bedrock IDs: Most Bedrock models require the us. cross-region inference profile prefix (Anthropic Claude, Amazon Nova). A few older models (Meta Llama 3, Mistral Large 24.02) use the bare ID without the prefix. The exact strings above are what AWS Bedrock accepts; copy verbatim. Bedrock's catalog of newer Meta and Mistral models lags behind these providers' direct APIs — for the latest Llama and Mistral, use groq/llama-3.3-70b-versatile or mistral/mistral-large-2512 respectively.
For the full, authoritative list:
curl -s https://api.gammainfra.com/v1/models | jq .
Streaming
Streaming works exactly like OpenAI — set stream: true and read Server-Sent Events. All providers are normalised to the OpenAI SSE format, so your existing code works unchanged.
stream = client.chat.completions.create(
model="gammainfra/auto",
messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Headers
Request headers
| Header | Value | Purpose |
|---|---|---|
Authorization | Bearer sk-gammainfra-… | Required |
Content-Type | application/json | Required |
X-GammaInfra-Routing | off | Disable smart routing; use the exact model you named |
X-GammaInfra-Routing | literal | Disable logical resolution. Bare names take the first registered endpoint instead of the router's preference-based pick. Independent of off. |
X-GammaInfra-Region | us / eu / apac or exact AWS region (e.g. us-east-1) | Constrain endpoint selection to a region group or exact region. Native APIs (region-agnostic) always pass. Combine with provider.only: ["bedrock"] for strict-residency mode. |
X-GammaInfra-Preference | quality (default) / cost / latency | Bias the router when using gammainfra/auto |
X-GammaInfra-Cost-Quality | Decimal in 0.0…1.0 | Continuous cost/quality dial. 0.0 = pure quality, 1.0 = pure cost. Overrides X-GammaInfra-Preference: quality/cost. An explicit latency preference still wins. Malformed values are ignored and the legacy preset applies. |
X-GammaInfra-Max-Latency-Ms | Integer ms in 60…600000 (10 minutes) | Bound total wall time across the fallback chain. Set to your hard deadline (e.g. 5000 for a 5-second SLA). On exceedance GammaInfra cancels any in-flight upstream call and returns 504 max_latency_exceeded. Strictly opt-in — absent header preserves prior behavior (per-provider 30 s default). Malformed or out-of-range values are ignored. |
Response headers
| Header | Meaning |
|---|---|
X-GammaInfra-Request-Id | Correlation ID — include when filing a support request |
X-GammaInfra-Provider | Which provider served the response (e.g. openai, anthropic) |
X-GammaInfra-Router-Version | Which routing path served the request. Values: v2 (default smart router), v2_keyword (sentence-start keyword shortcut), v2_flag (capability short-circuit for multimodal or tools), v2_short_prompt (length-based fast-path for trivial prompts), v2_hedged (parallel top-2 race for gammainfra/fast), v2_logical (cross-host logical-name routing), v1_fallback (low-confidence fall-through to the keyword router), direct (you pinned a specific provider/model), logical_literal (you opted into X-GammaInfra-Routing: literal), models_override (you supplied a models[] fallback list). |
X-GammaInfra-Logical-Model | The router's label for the prompt (e.g. reasoning, code, chat) or the logical model name (e.g. claude-opus-4-7) when bare-name / vendor-prefix routing fired. Use it to correlate your cost analytics with the type of work. |
X-GammaInfra-Endpoint | The actual physical endpoint that served the request, formatted as provider/model (e.g. bedrock/us.anthropic.claude-opus-4-7). Always present on successful chat completions. |
X-GammaInfra-Region-Used | The region the request was served from (e.g. us-east-1). Present on routes that went through a regional endpoint; absent for native APIs which are region-agnostic. |
X-GammaInfra-Flags | Capability flags fired up-front, comma-separated (e.g. tool_use, multimodal). Absent when no flag fired. |
X-GammaInfra-Cost-USD | Per-request cost in USD with 6 decimals (e.g. 0.000087). Present on every successful chat completion. Sum it across calls to get exact spend without parsing usage × per-model price tables. Reflects provider list price — GammaInfra adds 0% token markup. |
X-GammaInfra-Input-Cost-USD / X-GammaInfra-Output-Cost-USD | Per-direction cost split for the same request, both in USD with 6 decimals. input + output = X-GammaInfra-Cost-USD within rounding. Useful for chargeback / per-team attribution against provider-side dashboards (Bedrock CloudWatch, OpenAI usage page) without a client-side recompute. |
X-GammaInfra-Cost-Quality-Applied | Present whenever the cost/quality dial drove the routing decision. Value is the parsed float (e.g. 0.800) so you can log and replay the decision. |
X-GammaInfra-Fallback-Chain | Comma-separated provider/model list actually attempted on this request, in order. A single entry means one leg was tried; multiple entries mean GammaInfra cascaded after a failure. Useful for post-mortems. |
X-GammaInfra-Attempted-Count | Integer count of legs attempted — matches len(X-GammaInfra-Fallback-Chain.split(",")). Use when you need to detect whether a fallback occurred without parsing the chain header. |
X-GammaInfra-Fallback-Reason | Why the chain walked past the first pick (e.g. provider_error, low_confidence, flag_chain, short_prompt_chat, v2_keyword, models_override). |
X-GammaInfra-Rust-Version | Version of the gateway's native fast-path components (e.g. 0.1.0). Useful for support correlation; treat as opaque. |
X-RateLimit-Limit / X-RateLimit-Remaining / X-RateLimit-Reset | Standard sliding-window rate-limit signals, keyed per API key. Default limit is 240 requests/minute. |
X-GammaInfra-Cache-Mode | Which cache mode was applied to this request. Values: auto (gateway injected breakpoints on the system+tools prefix), aggressive (also cached conversation history turns), manual (you supplied your own cache_control markers; gateway did not add more), off (caching disabled for this request). Always present on successful chat completions. |
X-GammaInfra-Cache-Read-Tokens | Number of tokens served from cache on this request. Only emitted when non-zero. Combined with X-GammaInfra-Cache-Write-Tokens and X-GammaInfra-Cost-USD, lets you verify caching is working and compute your actual effective rate without parsing provider-specific usage fields. |
X-GammaInfra-Cache-Write-Tokens | Number of tokens written to cache on this request (cache-priming cost). Only emitted when non-zero. Cache writes carry a small premium over standard input pricing for Anthropic and Bedrock; your X-GammaInfra-Cost-USD header reflects the all-in cost including that premium. |
Preference precedence
You can express routing preference through five different channels — header, body, model suffix, or model shortcut. When more than one is set, the most-specific wins:
X-GammaInfra-Preference: latency— always wins (latency is orthogonal to cost/quality).- Body
provider.sort(ecosystem-compat) —price→ cost,throughput/latency→ latency. X-GammaInfra-Cost-Qualityheader — continuous0.0…1.0dial.X-GammaInfra-Preference: quality | costheader — legacy preset.- Model-slug variant (
:nitro→ latency,:floor→ cost) or thegammainfra/fast/gammainfra/cheapshortcuts. - Default:
quality.
Balance & pricing
- Your balance is denominated in USD — no in-platform credits unit, no conversion math. The dashboard, ledger, and
GET /v1/billing/balanceall return dollars (e.g.2.4738means $2.4738 of remaining balance). - New accounts start with $3.00 of free balance.
- 0% markup on tokens — per-request provider cost is passed through at list prices.
- 5% fee on managed top-ups (3% launch-window rate during the first 60 days).
- Minimum top-up: $10. No subscription. Balance never expires.
Approximate cost per 1M tokens
| Model | Input | Output |
|---|---|---|
gammainfra/auto (routed to the best model for each prompt) | varies per task type — see rows below | varies per task type — see rows below |
openai/gpt-5.5 | $5.00 | $30.00 |
openai/gpt-5.4 | $2.00 | $8.00 |
openai/gpt-5.4-mini | $0.40 | $1.60 |
openai/gpt-5-mini | $0.25 | $2.00 |
anthropic/claude-opus-4-7 | $5.00 | $25.00 |
anthropic/claude-sonnet-4-6 | $3.00 | $15.00 |
google/gemini-3.1-pro-preview | $1.25 | $5.00 |
google/gemini-3-flash-preview | $0.30 | $2.50 |
deepseek/deepseek-v4-pro | $1.74 | $3.48 |
deepseek/deepseek-v4-flash | $0.14 | $0.28 |
groq/llama-3.1-8b-instant | $0.06 | $0.08 |
Costs above are provider list prices — GammaInfra passes them straight through. The only GammaInfra fee is the 5% we charge when you top up credits (3% during the launch window). For the full cost table and legal terms, see the Terms of Service.
Reasoning tokens on gpt-5 and DeepSeek V4
OpenAI’s gpt-5 family and DeepSeek’s V4 reasoner (deepseek-v4-pro) bill hidden “reasoning tokens” in addition to the visible output. Reasoning tokens are the model’s chain-of-thought and are not returned in the response but are counted in usage.completion_tokens.
GammaInfra silently caps gpt-5 reasoning at max_completion_tokens=2048 when the caller omits the parameter, and picks a conservative reasoning_effort based on the router’s logical label (chat → low, code/summarize → medium, reasoning/math → high). This prevents the “‘hi’ burned 320 reasoning tokens” pathology from reaching your bill, but you should still budget 2–4× visible output tokens for gpt-5-family calls in batch sizing. Inspect usage.completion_tokens_details.reasoning_tokens in any response to see the split.
Prompt caching and your bill
When a repeated prefix is served from cache, the provider charges a reduced rate for those tokens. GammaInfra passes the discount straight through — you are billed at the provider’s actual cache-read rate, never the full input rate. Your X-GammaInfra-Cost-USD header reflects the all-in cost including any cache-read discounts and cache-write premiums on the same call. For per-direction detail, see X-GammaInfra-Cache-Read-Tokens and X-GammaInfra-Cache-Write-Tokens in the response headers above.
Cache writes carry a small premium over standard input pricing for Anthropic and Bedrock (the provider charges extra to prime the cache). On first use the total cost is slightly higher; on repeated calls the cache-read savings more than offset the initial write. Set X-GammaInfra-Cache: off if you are sending one-off requests and don’t want the write overhead.
Check your balance
curl -s https://api.gammainfra.com/v1/billing/balance \
-H "Authorization: Bearer sk-gammainfra-..."
{"balance_usd": 0.97, "customer_id": "..."}
Top up
Top up your balance from dashboard.gammainfra.com → Top up. You’ll be redirected to Stripe’s hosted checkout and back to your dashboard once payment clears. Card data is handled by Stripe — GammaInfra never sees it. Amount range: $5 – $1000; your balance updates within seconds of Stripe’s confirmation.
Bring your own key (BYOK)
Optional. By default GammaInfra uses its own provider API keys on your behalf — one GammaInfra key, every model. If you already have a direct relationship with a provider, add your own key at dashboard.gammainfra.com → Provider Keys and GammaInfra will route requests to that provider through your key instead.
- Your key is stored encrypted at rest (Fernet).
- The provider bills you directly (on your account with them) for the tokens a BYOK-routed request consumes. GammaInfra only charges a small per-request routing fee from a separate prepaid BYOK balance — see BYOK pricing below.
- Revoking or deleting your key falls back to managed routing (if the provider offers managed access) or skips that provider.
- Supported providers:
openai,anthropic,google,mistral,groq,deepseek,grok.
Add a key
curl -s -X POST https://api.gammainfra.com/v1/provider-keys \
-H "Authorization: Bearer sk-gammainfra-..." \
-H "Content-Type: application/json" \
-d '{"provider_name": "openai", "api_key": "sk-..."}'
List your keys
curl -s https://api.gammainfra.com/v1/provider-keys \
-H "Authorization: Bearer sk-gammainfra-..."
Delete a key
curl -s -X DELETE https://api.gammainfra.com/v1/provider-keys/openai \
-H "Authorization: Bearer sk-gammainfra-..."
Or manage all of this from dashboard.gammainfra.com → Provider Keys.
BYOK pricing — separate prepaid balance
BYOK traffic uses its own prepaid balance, distinct from your managed credits. Top it up from the dashboard's BYOK Balance tab or via POST /v1/billing/byok/checkout (minimum $5, no top-up fee). Each BYOK-routed request deducts a small per-request fee:
- 2% of the retail provider cost (standard rate).
- 1% during the launch window (first 60 days).
- When the BYOK balance hits $0, BYOK-routed requests return
402 byok_balance_empty— top up to resume. Managed credits are not touched. - You'll receive one low-balance email when your BYOK balance drops below 20% of your last top-up (7-day cooldown between reminders).
curl -s -X POST https://api.gammainfra.com/v1/billing/byok/checkout \
-H "Authorization: Bearer sk-gammainfra-..." \
-H "Content-Type: application/json" \
-d '{"amount_usd": 25.0}'
Check your BYOK balance:
curl -s https://api.gammainfra.com/v1/billing/byok/balance \
-H "Authorization: Bearer sk-gammainfra-..."
Prompt caching
Repeating the same system prompt, tool definitions, or conversation prefix across multiple calls is the most common source of avoidable spend. GammaInfra automatically caches these prefixes on providers that support it, so cache-hit tokens on follow-up calls cost a fraction of full input tokens.
How it works
GammaInfra detects cacheable prefixes and, where supported, injects cache_control breakpoints before the request leaves the gateway:
- Anthropic (Claude): Gateway auto-injects breakpoints on the system prompt and tools prefix when the combined token count meets the minimum threshold (1024 tokens for most models, 2048 for Claude Haiku). You do not need to add
cache_controlmarkers yourself — the gateway does it for you. - OpenAI, DeepSeek, Google Gemini: Caching is automatic on the provider side. The gateway reads the discounted token counts from the response and bills at the provider’s actual cache-read rate.
- Amazon Bedrock: Cache read tokens are billed at the provider’s discounted rate. Bedrock auto-inject is a planned fast-follow.
- Groq, Mistral, xAI: Do not report cache tokens. Always billed at standard input rates.
Controlling cache behaviour
Send the X-GammaInfra-Cache request header to override the default:
| Value | Effect |
|---|---|
auto (default) | Gateway injects breakpoints on the system prompt and tools prefix when the prefix meets the minimum token threshold. |
aggressive | Extends caching to include recent conversation history turns, in addition to the system prompt and tools prefix. Useful for long multi-turn sessions where the conversation context is stable across many calls. |
off | Disables auto-injection for this request. Use for one-off requests where you do not want to pay the cache-write premium. Provider-side automatic caching (OpenAI, DeepSeek, Gemini) is not affected. |
| manual (implicit) | If GammaInfra detects that you have already added cache_control markers to your messages, it backs off and preserves your markers unchanged. The response will show X-GammaInfra-Cache-Mode: manual. |
Unknown values for X-GammaInfra-Cache are silently ignored and fall back to the default — the header never returns a 400 error.
Verifying cache hits
Check the response headers on any call:
X-GammaInfra-Cache-Mode: auto
X-GammaInfra-Cache-Read-Tokens: 4096
X-GammaInfra-Cache-Write-Tokens: 512
X-GammaInfra-Cost-USD: 0.000041
X-GammaInfra-Cache-Read-Tokens is the number of input tokens served from cache on this call. X-GammaInfra-Cache-Write-Tokens is the number written to cache (priming cost). Both are absent when zero. X-GammaInfra-Cost-USD is always the all-in cost, inclusive of any cache-read discounts and write premiums.
When caching saves money
Caching wins when the same prefix is reused across at least two calls. On the first call GammaInfra writes the prefix to cache (small premium); on subsequent calls those tokens are served at the provider’s cache-read rate (significant discount). The break-even point is reached quickly for system prompts longer than ~1k tokens that are reused across many requests.
For truly one-off requests where the prefix will never repeat, set X-GammaInfra-Cache: off to skip the write cost.
Error codes
Error responses use a consistent JSON shape:
{
"error": {
"message": "Human-readable description",
"type": "error_type",
"code": "machine_readable_code",
"request_id": "uuid"
}
}
| Status | Code | Meaning |
|---|---|---|
400 | web_plugin_unsupported | Request used the :online model suffix. GammaInfra doesn't ship web search yet — drop the suffix. |
400 | free_tier_unavailable | Request used the :free model suffix. GammaInfra has no free tier; new accounts get $3.00 of free balance on signup. |
400 | provider_excluded | You pinned a specific provider/model while your provider.only/provider.ignore filter excluded that provider. Drop the pin or adjust the filter. |
401 | invalid_api_key | Missing or invalid API key. |
402 | insufficient_credits | Managed balance can’t cover the request. Top up from the dashboard. |
402 | byok_balance_empty | BYOK prepaid balance exhausted — top up to resume. |
404 | model_not_found | Unknown provider/model. Check GET /v1/models for the live catalogue. |
404 | generation_not_found | GET /v1/generation?id= couldn't find a record for that request_id, or it belongs to a different customer. |
422 | — | Invalid request body (pydantic validation). |
429 | rate_limit_exceeded | You hit the 240 req/min per-API-key cap. Respect Retry-After. |
429 | — | Provider-side rate limit passed through — respect Retry-After. |
501 | web_plugin_unsupported | Request body carried plugins:[{id:"web"}]. Remove the plugin. |
503 | providers_down | All providers in the fallback chain failed. |
X-GammaInfra-Request-Id from the response headers if you file a support ticket.
Rate limits
- Per-API-key cap: 240 requests per minute on
/v1/chat/completionsand/v1/completions(and their/api/v1/*aliases). Sliding 60-second window. Static endpoints (/v1/models,/v1/status,/health,/ready) don't count. - Every response carries
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Reset(Unix seconds). When you hit the cap you get429with aRetry-Afterheader in seconds. - Provider-side rate limits also pass through as
429with anyRetry-Afterthe provider returned. - If you expect sustained traffic above the per-key cap, email hello@gammainfra.com — we raise caps for steady-state volume on request.
Status
Live per-provider uptime, latency, and error counts are published:
- status.gammainfra.com — human-readable HTML dashboard, auto-refreshes every 30 s
GET /v1/status— same data as JSON, safe to poll from your own monitoring
Both endpoints are public (no auth). Each provider is marked operational, degraded, or outage based on the rolling 24 h request log plus a live health-check ping.
Support
Email support@gammainfra.com, or join our Discord and open a ticket in #support. Include the X-GammaInfra-Request-Id response header from any failing request — it lets us trace the exact path the request took through the router.
For policy and billing terms, see Terms and Privacy.
FAQ
Common developer questions about the API. For the conceptual overview see the FAQ on the landing page.
What is the GammaInfra API base URL?
https://api.gammainfra.com/v1. The legacy apex https://gammainfra.com/v1/* is preserved as a back-compat alias. Authentication is Authorization: Bearer sk-gammainfra-<your-key>. Both /v1/* and /api/v1/* prefixes are mounted with identical responses.How do I see the cost of each request?
X-GammaInfra-Cost-USD (total), X-GammaInfra-Input-Cost-USD, and X-GammaInfra-Output-Cost-USD. Sum the totals across a session to know exactly what your workload cost. The X-GammaInfra-Endpoint header tells you which provider/model served the request.What happens when an upstream provider rate-limits a request?
X-GammaInfra-Fallback-Chain response header. For strict-provider behavior, set provider.only in your request body or pass X-GammaInfra-Routing: literal to constrain the chain.Can I pin a specific model instead of using smart routing?
openai/gpt-5-mini, anthropic/claude-opus-4-7, or bedrock/us.anthropic.claude-sonnet-4-6. Bare logical names (e.g., claude-opus-4-7) resolve through the registry — the router picks native vs Bedrock based on live p50 latency.How do I enforce a max latency budget per request?
X-GammaInfra-Max-Latency-Ms: <ms> on the request (range 60 to 600 000). On timeout, the upstream call is cancelled and a 504 max_latency_exceeded response is returned. Malformed values are silently dropped — the header never causes a 400 error.How does BYOK pricing differ from managed?
cost_usd per request during the launch window (2% standard) against a separate prepaid BYOK balance. Managed uses GammaInfra's negotiated provider rates with 0% token markup plus a 3% (launch) / 5% (standard) fee at top-up time. When BYOK balance hits $0, requests return 402 byok_balance_empty — never silently fall back to the managed balance.