GammaInfra API documentation

Q: What is the GammaInfra API base URL?

https://api.gammainfra.com/v1. The legacy apex https://gammainfra.com/v1/* is preserved as a back-compat alias. Authentication is Authorization: Bearer sk-gammainfra- . Both /v1/* and /api/v1/* prefixes are mounted with identical responses.

Q: How do I see the cost of each request?

Every successful response carries three USD cost headers: X-GammaInfra-Cost-USD (total), X-GammaInfra-Input-Cost-USD, and X-GammaInfra-Output-Cost-USD. Sum the totals across a session to know exactly what your workload cost. The X-GammaInfra-Endpoint header tells you which provider/model served the request.

Q: Can I pin a specific model instead of using smart routing?

Yes — use a host-prefixed model name like openai/gpt-5-mini, anthropic/claude-opus-4-7, or bedrock/us.anthropic.claude-sonnet-4-6. Bare logical names (e.g., claude-opus-4-7) resolve through the registry — the router picks native vs Bedrock based on live p50 latency.

Q: How do I enforce a max latency budget per request?

Set X-GammaInfra-Max-Latency-Ms: on the request (range 60 to 600000). On timeout, the upstream call is cancelled and a 504 max_latency_exceeded response is returned. Malformed values are silently dropped — the header never causes a 400 error.

Last updated: 20 May 2026

On this page

Overview
Quickstart
Authentication
Smart routing
Model names
Streaming
Headers
Balance & pricing
Top up
BYOK
Error codes
Rate limits
Status
Support
Prompt caching
FAQ

Overview

GammaInfra is an intelligent LLM routing engine. The router classifies every prompt by task and dispatches to the best-fit model across every major LLM provider — delivered through one API.

If you already use OpenAI or any OpenAI-compatible SDK, switching to GammaInfra is a one-line change — update the base_url, keep everything else the same.

Base URL: https://gammainfra.com
Dashboard: dashboard.gammainfra.com · Status: status.gammainfra.com · Sign up: gammainfra.com/#signup

Quickstart

1. Get an API key

Sign up at gammainfra.com — email + password, one-time verification link, no credit card. New accounts come with $3.00 of free credit, enough to try the router end-to-end. Your API key is shown once after you click the verification link — store it somewhere safe. Need more keys later? Issue and revoke them from the dashboard.

2. Make your first call

curl -s -X POST https://api.gammainfra.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-gammainfra-..." \
  -d '{
    "model": "gammainfra/auto",
    "messages": [{"role": "user", "content": "Explain transformers in one paragraph."}]
  }'

The response is identical to OpenAI’s format. gammainfra/auto lets the router pick the best model for your prompt.

3. Drop-in replacement

from openai import OpenAI

client = OpenAI(
    api_key="sk-gammainfra-...",
    base_url="https://api.gammainfra.com/v1",
)

response = client.chat.completions.create(
    model="gammainfra/auto",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

The same pattern works with LangChain, LlamaIndex, and any other OpenAI-compatible library.

Ecosystem compatibility

If your code targets an OpenAI-compatible router, the migration is usually two strings — base_url and api_key. Both /v1/* and /api/v1/* prefixes are mounted with identical responses, so SDKs that hard-code either path keep working unchanged.

from openai import OpenAI
client = OpenAI(
    api_key="sk-gammainfra-...",
    base_url="https://gammainfra.com/api/v1",
)

Body fields

GammaInfra accepts the common ecosystem-extension request fields. The ones below change behavior; everything else is accepted silently for forward compatibility.

Field	Behavior
`models: [str, ...]`	Honored. Becomes the authoritative fallback chain — tried in order, no auto-router. Fails loud (503) on exhaustion rather than silently picking another model.
`provider.sort`	Honored. `"price"` → cost-optimized routing, `"throughput"`/`"latency"` → latency-optimized.
`provider.only` / `.ignore`	Honored. Filter the candidate provider set.
`provider.order`	Honored. Listed providers tried first; rest keep their relative order.
`provider.allow_fallbacks: false`	Honored. Returns 503 on the first provider failure instead of trying the next candidate.
`provider.max_price`	Honored. `{prompt, completion}` in USD per 1M tokens. Endpoints exceeding either cap are skipped.
`reasoning: {effort, ...}`	Honored. `effort` translates to `reasoning_effort` for the GPT-5 family. Other providers drop it harmlessly.
`stream_options.include_usage`	Honored on OpenAI; forwarded best-effort elsewhere.
`tool_choice`, `response_format`, `parallel_tool_calls`, `seed`, `top_k`, `min_p`, `top_a`, `repetition_penalty`, `logprobs`, `top_logprobs`, `user`	Forwarded to providers that support them.
`transforms`, `route`	Accepted silently (we always cascade through the fallback chain on failure).
`plugins: [{id: "web"}]`	501 `web_plugin_unsupported`. The `:online` model variant is also rejected (400).

Model name aliases

GammaInfra uses the conventional vendor/slug format and accepts the common ecosystem aliases directly:

Input	Behavior
Third-party router `vendor/auto` aliases	Mapped to `gammainfra/auto` for migration compatibility.
`vendor/model:nitro`	Suffix stripped, preference forced to latency.
`vendor/model:floor`	Suffix stripped, preference forced to cost.
`vendor/model:exacto`	Suffix stripped (GammaInfra quality-sorts by default).
`vendor/model:online`	400 `web_plugin_unsupported`.
`vendor/model:free`	400 `free_tier_unavailable`. New accounts get $3.00 of free balance on signup — see Balance & pricing.
`meta-llama/llama-3.3-70b-instruct`	Routed via Groq (`groq/llama-3.3-70b-versatile`).
`meta-llama/llama-3.1-8b-instruct`	Routed via Groq (`groq/llama-3.1-8b-instant`).

Unknown :suffix variants are stripped silently for forward-compat. For the full catalogue of native model IDs, see Model names below.

Compatibility endpoints

The auxiliary endpoints SDKs commonly call are mounted under both prefixes and return ecosystem-compatible JSON.

Endpoint	Returns
`GET /api/v1/credits`	`{data: {total_credits, total_usage}}` — lifetime top-ups and lifetime spend in USD.
`GET /api/v1/generation?id=<request_id>`	Post-hoc stats for a previous request: tokens, cost, latency. Use the `X-GammaInfra-Request-Id` response header value as the `id`. Customer-scoped (no cross-account lookups).
`GET /api/v1/key`	Info about the calling key — label, name, lifetime usage. Per-key spend limits return `null` (not yet supported).
`GET /api/v1/models/{author}/{slug}/endpoints`	Endpoint listing for a single model. GammaInfra routes each model through one provider, so the array has one entry.
`POST /api/v1/completions`	Legacy text completion. Internally wraps your prompt as a chat message and rewrites the response to the `text_completion` shape (`choices[].text` instead of `choices[].message.content`). Streaming supported.
`GET /api/v1/models`	Catalogue with both GammaInfra-native fields (`input_cost_per_1k`, etc.) and ecosystem-shaped fields (`pricing.{prompt,completion}` per-token strings, `context_length`, `architecture`, `supported_parameters`, `top_provider`).

Headers

Send HTTP-Referer and X-Title on every request — GammaInfra stores them with each request log so per-app analytics work consistently. Both are best-effort and entirely optional.

What's not supported

Web search (plugins:[{id:"web"}] and the :online variant) — not implemented. We reject explicitly so your prompts don't silently degrade.
Free tier (:free) — we don't offer one. Every account starts with $3.00 of free balance; top up from /billing/checkout.
Per-key spend limits (limit, limit_remaining, limit_reset) — not yet implemented; /key returns null for these fields.
OAuth/PKCE key exchange — sign in via the dashboard and copy a key.

Authentication

Every authenticated request carries a Bearer token. The public endpoints are /v1/models, /v1/status, /health, /ready, and the signup/login routes; everything else needs a valid key.

Authorization: Bearer sk-gammainfra-...

Keys are prefixed sk-gammainfra-. The plaintext is only returned on creation — GammaInfra stores a bcrypt hash. Create additional keys or revoke old ones from the dashboard.

Status	Meaning
`401`	Missing or invalid API key
`402`	Insufficient credits — top up and retry

Smart routing

Send model: "gammainfra/auto" and the router classifies your prompt into one of 8 task labels (plus 2 deterministic capability flags), then dispatches to the best-fit model for that type. If the primary model is unavailable, GammaInfra falls back through a chain of 3–4 models automatically.

Capability flags are decided up-front from the request body, never from prompt text:

Flag	When it fires
`tool_use`	Request body has a non-empty `tools` array
`multimodal`	Any message contains an `image_url` part

For everything else the prompt text is classified into one of these 8 labels:

Task label	When it fires
`reasoning`	Multi-step analysis, math, root-cause questions (e.g. analyse, compare, evaluate, prove, probability, root cause, step-by-step)
`code`	Code generation, debugging, refactoring (e.g. function, implement, debug, regex, unit test)
`creative`	Original generative writing (e.g. poem, story, essay, brainstorm, lyrics)
`rewrite`	Edit or transform existing text while preserving meaning
`extraction`	Pull structured fields out of text (e.g. sentence-start extract, parse, list all, classify, format as json)
`summarize`	Compress text (e.g. sentence-start summarize, tldr, key points, brief, condense)
`translation`	Cross-language conversion (e.g. sentence-start translate, in/to spanish/french/japanese/…)
`chat`	Default when nothing else matches

Prefer a trade-off? Send X-GammaInfra-Preference: quality (default), cost, or latency to bias the router.
Want finer control? Send a continuous X-GammaInfra-Cost-Quality: 0.0 (pure quality) … 1.0 (pure cost) header and GammaInfra will place you on that axis. The server echoes X-GammaInfra-Cost-Quality-Applied on the response so you can log exactly what landed. An explicit X-GammaInfra-Preference: latency always wins over the cost-quality dial.
Want to opt out? Send X-GammaInfra-Routing: off and GammaInfra will route straight to the exact model you named in model.

Model names

Smart aliases (recommended)

Model name	Behaviour
`gammainfra/auto`	Picks the best-fit model for your prompt type
`gammainfra/fast`	Optimises for lowest latency (equivalent to `X-GammaInfra-Preference: latency`)
`gammainfra/cheap`	Optimises for lowest cost (equivalent to `X-GammaInfra-Preference: cost`)

Bare model names (logical)

Type a bare model name and GammaInfra's router picks the best endpoint that serves it. Useful when the same model is hosted by more than one provider (e.g. Claude Opus is reachable via the native Anthropic API and Amazon Bedrock).

claude-opus-4-7
claude-opus-4-6
claude-sonnet-4-6
claude-haiku-4-5
nova-pro
nova-2-lite
gpt-5-mini
gpt-5.4-mini
deepseek-v4-pro
mistral-large-2512
llama-3.3-70b-versatile
gemini-3.1-pro-preview
grok-4-1-fast-non-reasoning

Bare names that aren't in the registry return 404 model_not_found (no silent fallback). Use X-GammaInfra-Routing: literal to disable cross-host routing and pin the first registered endpoint instead.

Pin a specific model

Prefix any model with its provider slug:

openai/gpt-5.4
openai/gpt-5.4-mini
openai/gpt-5.4-nano
openai/gpt-5-mini
anthropic/claude-opus-4-6
anthropic/claude-sonnet-4-6
anthropic/claude-haiku-4-5
google/gemini-3.1-pro-preview
google/gemini-3-flash-preview
google/gemini-2.5-pro
google/gemini-2.5-flash
mistral/mistral-large-2512
mistral/mistral-small-2603
mistral/codestral-2508
mistral/devstral-2512
groq/llama-3.3-70b-versatile
groq/llama-3.1-8b-instant
groq/qwen/qwen3-32b
deepseek/deepseek-v4-pro
deepseek/deepseek-v4-flash
# Legacy V3 slugs — still routable via direct pin, retire 2026-07-24:
# deepseek/deepseek-chat, deepseek/deepseek-reasoner
grok/grok-4.20-0309-reasoning
grok/grok-4-1-fast-reasoning
grok/grok-4-1-fast-non-reasoning
bedrock/us.anthropic.claude-opus-4-7
bedrock/us.anthropic.claude-opus-4-6-v1
bedrock/us.anthropic.claude-sonnet-4-6
bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0
bedrock/meta.llama3-70b-instruct-v1:0
bedrock/mistral.mistral-large-2402-v1:0
bedrock/us.amazon.nova-pro-v1:0
bedrock/us.amazon.nova-2-lite-v1:0

Note on Bedrock IDs: Most Bedrock models require the us. cross-region inference profile prefix (Anthropic Claude, Amazon Nova). A few older models (Meta Llama 3, Mistral Large 24.02) use the bare ID without the prefix. The exact strings above are what AWS Bedrock accepts; copy verbatim. Bedrock's catalog of newer Meta and Mistral models lags behind these providers' direct APIs — for the latest Llama and Mistral, use groq/llama-3.3-70b-versatile or mistral/mistral-large-2512 respectively.

For the full, authoritative list:

curl -s https://api.gammainfra.com/v1/models | jq .

Streaming

Streaming works exactly like OpenAI — set stream: true and read Server-Sent Events. All providers are normalised to the OpenAI SSE format, so your existing code works unchanged.

stream = client.chat.completions.create(
    model="gammainfra/auto",
    messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Headers

Request headers

Header	Value	Purpose
`Authorization`	`Bearer sk-gammainfra-…`	Required
`Content-Type`	`application/json`	Required
`X-GammaInfra-Routing`	`off`	Disable smart routing; use the exact model you named
`X-GammaInfra-Routing`	`literal`	Disable logical resolution. Bare names take the first registered endpoint instead of the router's preference-based pick. Independent of `off`.
`X-GammaInfra-Region`	`us` / `eu` / `apac` or exact AWS region (e.g. `us-east-1`)	Constrain endpoint selection to a region group or exact region. Native APIs (region-agnostic) always pass. Combine with `provider.only: ["bedrock"]` for strict-residency mode.
`X-GammaInfra-Preference`	`quality` (default) / `cost` / `latency`	Bias the router when using `gammainfra/auto`
`X-GammaInfra-Cost-Quality`	Decimal in `0.0`…`1.0`	Continuous cost/quality dial. `0.0` = pure quality, `1.0` = pure cost. Overrides `X-GammaInfra-Preference: quality`/`cost`. An explicit `latency` preference still wins. Malformed values are ignored and the legacy preset applies.
`X-GammaInfra-Max-Latency-Ms`	Integer ms in `60`…`600000` (10 minutes)	Bound total wall time across the fallback chain. Set to your hard deadline (e.g. `5000` for a 5-second SLA). On exceedance GammaInfra cancels any in-flight upstream call and returns `504 max_latency_exceeded`. Strictly opt-in — absent header preserves prior behavior (per-provider 30 s default). Malformed or out-of-range values are ignored.

Response headers

Header	Meaning
`X-GammaInfra-Request-Id`	Correlation ID — include when filing a support request
`X-GammaInfra-Provider`	Which provider served the response (e.g. `openai`, `anthropic`)
`X-GammaInfra-Router-Version`	Which routing path served the request. Values: `v2` (default smart router), `v2_keyword` (sentence-start keyword shortcut), `v2_flag` (capability short-circuit for multimodal or tools), `v2_short_prompt` (length-based fast-path for trivial prompts), `v2_hedged` (parallel top-2 race for `gammainfra/fast`), `v2_logical` (cross-host logical-name routing), `v1_fallback` (low-confidence fall-through to the keyword router), `direct` (you pinned a specific `provider/model`), `logical_literal` (you opted into `X-GammaInfra-Routing: literal`), `models_override` (you supplied a `models[]` fallback list).
`X-GammaInfra-Logical-Model`	The router's label for the prompt (e.g. `reasoning`, `code`, `chat`) or the logical model name (e.g. `claude-opus-4-7`) when bare-name / vendor-prefix routing fired. Use it to correlate your cost analytics with the type of work.
`X-GammaInfra-Endpoint`	The actual physical endpoint that served the request, formatted as `provider/model` (e.g. `bedrock/us.anthropic.claude-opus-4-7`). Always present on successful chat completions.
`X-GammaInfra-Region-Used`	The region the request was served from (e.g. `us-east-1`). Present on routes that went through a regional endpoint; absent for native APIs which are region-agnostic.
`X-GammaInfra-Flags`	Capability flags fired up-front, comma-separated (e.g. `tool_use`, `multimodal`). Absent when no flag fired.
`X-GammaInfra-Cost-USD`	Per-request cost in USD with 6 decimals (e.g. `0.000087`). Present on every successful chat completion. Sum it across calls to get exact spend without parsing `usage` × per-model price tables. Reflects provider list price — GammaInfra adds 0% token markup.
`X-GammaInfra-Input-Cost-USD` / `X-GammaInfra-Output-Cost-USD`	Per-direction cost split for the same request, both in USD with 6 decimals. `input + output = X-GammaInfra-Cost-USD` within rounding. Useful for chargeback / per-team attribution against provider-side dashboards (Bedrock CloudWatch, OpenAI usage page) without a client-side recompute.
`X-GammaInfra-Cost-Quality-Applied`	Present whenever the cost/quality dial drove the routing decision. Value is the parsed float (e.g. `0.800`) so you can log and replay the decision.
`X-GammaInfra-Fallback-Chain`	Comma-separated `provider/model` list actually attempted on this request, in order. A single entry means one leg was tried; multiple entries mean GammaInfra cascaded after a failure. Useful for post-mortems.
`X-GammaInfra-Attempted-Count`	Integer count of legs attempted — matches `len(X-GammaInfra-Fallback-Chain.split(","))`. Use when you need to detect whether a fallback occurred without parsing the chain header.
`X-GammaInfra-Fallback-Reason`	Why the chain walked past the first pick (e.g. `provider_error`, `low_confidence`, `flag_chain`, `short_prompt_chat`, `v2_keyword`, `models_override`).
`X-GammaInfra-Rust-Version`	Version of the gateway's native fast-path components (e.g. `0.1.0`). Useful for support correlation; treat as opaque.
`X-RateLimit-Limit` / `X-RateLimit-Remaining` / `X-RateLimit-Reset`	Standard sliding-window rate-limit signals, keyed per API key. Default limit is 240 requests/minute.
`X-GammaInfra-Cache-Mode`	Which cache mode was applied to this request. Values: `auto` (gateway injected breakpoints on the system+tools prefix), `aggressive` (also cached conversation history turns), `manual` (you supplied your own `cache_control` markers; gateway did not add more), `off` (caching disabled for this request). Always present on successful chat completions.
`X-GammaInfra-Cache-Read-Tokens`	Number of tokens served from cache on this request. Only emitted when non-zero. Combined with `X-GammaInfra-Cache-Write-Tokens` and `X-GammaInfra-Cost-USD`, lets you verify caching is working and compute your actual effective rate without parsing provider-specific usage fields.
`X-GammaInfra-Cache-Write-Tokens`	Number of tokens written to cache on this request (cache-priming cost). Only emitted when non-zero. Cache writes carry a small premium over standard input pricing for Anthropic and Bedrock; your `X-GammaInfra-Cost-USD` header reflects the all-in cost including that premium.

Preference precedence

You can express routing preference through five different channels — header, body, model suffix, or model shortcut. When more than one is set, the most-specific wins:

X-GammaInfra-Preference: latency — always wins (latency is orthogonal to cost/quality).
Body provider.sort (ecosystem-compat) — price → cost, throughput/latency → latency.
X-GammaInfra-Cost-Quality header — continuous 0.0…1.0 dial.
X-GammaInfra-Preference: quality | cost header — legacy preset.
Model-slug variant (:nitro → latency, :floor → cost) or the gammainfra/fast / gammainfra/cheap shortcuts.
Default: quality.

Balance & pricing

Your balance is denominated in USD — no in-platform credits unit, no conversion math. The dashboard, ledger, and GET /v1/billing/balance all return dollars (e.g. 2.4738 means $2.4738 of remaining balance).
New accounts start with $3.00 of free balance.
0% markup on tokens — per-request provider cost is passed through at list prices.
5% fee on managed top-ups (3% launch-window rate during the first 60 days).
Minimum top-up: $10. No subscription. Balance never expires.

Approximate cost per 1M tokens

Model	Input	Output
`gammainfra/auto` (routed to the best model for each prompt)	varies per task type — see rows below	varies per task type — see rows below
`openai/gpt-5.5`	$5.00	$30.00
`openai/gpt-5.4`	$2.00	$8.00
`openai/gpt-5.4-mini`	$0.40	$1.60
`openai/gpt-5-mini`	$0.25	$2.00
`anthropic/claude-opus-4-7`	$5.00	$25.00
`anthropic/claude-sonnet-4-6`	$3.00	$15.00
`google/gemini-3.1-pro-preview`	$1.25	$5.00
`google/gemini-3-flash-preview`	$0.30	$2.50
`deepseek/deepseek-v4-pro`	$1.74	$3.48
`deepseek/deepseek-v4-flash`	$0.14	$0.28
`groq/llama-3.1-8b-instant`	$0.06	$0.08

Costs above are provider list prices — GammaInfra passes them straight through. The only GammaInfra fee is the 5% we charge when you top up credits (3% during the launch window). For the full cost table and legal terms, see the Terms of Service.

Reasoning tokens on gpt-5 and DeepSeek V4

OpenAI’s gpt-5 family and DeepSeek’s V4 reasoner (deepseek-v4-pro) bill hidden “reasoning tokens” in addition to the visible output. Reasoning tokens are the model’s chain-of-thought and are not returned in the response but are counted in usage.completion_tokens.

GammaInfra silently caps gpt-5 reasoning at max_completion_tokens=2048 when the caller omits the parameter, and picks a conservative reasoning_effort based on the router’s logical label (chat → low, code/summarize → medium, reasoning/math → high). This prevents the “‘hi’ burned 320 reasoning tokens” pathology from reaching your bill, but you should still budget 2–4× visible output tokens for gpt-5-family calls in batch sizing. Inspect usage.completion_tokens_details.reasoning_tokens in any response to see the split.

Prompt caching and your bill

When a repeated prefix is served from cache, the provider charges a reduced rate for those tokens. GammaInfra passes the discount straight through — you are billed at the provider’s actual cache-read rate, never the full input rate. Your X-GammaInfra-Cost-USD header reflects the all-in cost including any cache-read discounts and cache-write premiums on the same call. For per-direction detail, see X-GammaInfra-Cache-Read-Tokens and X-GammaInfra-Cache-Write-Tokens in the response headers above.

Cache writes carry a small premium over standard input pricing for Anthropic and Bedrock (the provider charges extra to prime the cache). On first use the total cost is slightly higher; on repeated calls the cache-read savings more than offset the initial write. Set X-GammaInfra-Cache: off if you are sending one-off requests and don’t want the write overhead.

Check your balance

curl -s https://api.gammainfra.com/v1/billing/balance \
  -H "Authorization: Bearer sk-gammainfra-..."

{"balance_usd": 0.97, "customer_id": "..."}

Top up

Top up your balance from dashboard.gammainfra.com → Top up. You’ll be redirected to Stripe’s hosted checkout and back to your dashboard once payment clears. Card data is handled by Stripe — GammaInfra never sees it. Amount range: $5 – $1000; your balance updates within seconds of Stripe’s confirmation.

Bring your own key (BYOK)

Optional. By default GammaInfra uses its own provider API keys on your behalf — one GammaInfra key, every model. If you already have a direct relationship with a provider, add your own key at dashboard.gammainfra.com → Provider Keys and GammaInfra will route requests to that provider through your key instead.

Your key is stored encrypted at rest (Fernet).
The provider bills you directly (on your account with them) for the tokens a BYOK-routed request consumes. GammaInfra only charges a small per-request routing fee from a separate prepaid BYOK balance — see BYOK pricing below.
Revoking or deleting your key falls back to managed routing (if the provider offers managed access) or skips that provider.
Supported providers: openai, anthropic, google, mistral, groq, deepseek, grok.

Add a key

curl -s -X POST https://api.gammainfra.com/v1/provider-keys \
  -H "Authorization: Bearer sk-gammainfra-..." \
  -H "Content-Type: application/json" \
  -d '{"provider_name": "openai", "api_key": "sk-..."}'

List your keys

curl -s https://api.gammainfra.com/v1/provider-keys \
  -H "Authorization: Bearer sk-gammainfra-..."

Delete a key

curl -s -X DELETE https://api.gammainfra.com/v1/provider-keys/openai \
  -H "Authorization: Bearer sk-gammainfra-..."

Or manage all of this from dashboard.gammainfra.com → Provider Keys.

BYOK pricing — separate prepaid balance

BYOK traffic uses its own prepaid balance, distinct from your managed credits. Top it up from the dashboard's BYOK Balance tab or via POST /v1/billing/byok/checkout (minimum $5, no top-up fee). Each BYOK-routed request deducts a small per-request fee:

2% of the retail provider cost (standard rate).
1% during the launch window (first 60 days).
When the BYOK balance hits $0, BYOK-routed requests return 402 byok_balance_empty — top up to resume. Managed credits are not touched.
You'll receive one low-balance email when your BYOK balance drops below 20% of your last top-up (7-day cooldown between reminders).

curl -s -X POST https://api.gammainfra.com/v1/billing/byok/checkout \
  -H "Authorization: Bearer sk-gammainfra-..." \
  -H "Content-Type: application/json" \
  -d '{"amount_usd": 25.0}'

Check your BYOK balance:

curl -s https://api.gammainfra.com/v1/billing/byok/balance \
  -H "Authorization: Bearer sk-gammainfra-..."

Prompt caching

Repeating the same system prompt, tool definitions, or conversation prefix across multiple calls is the most common source of avoidable spend. GammaInfra automatically caches these prefixes on providers that support it, so cache-hit tokens on follow-up calls cost a fraction of full input tokens.

How it works

GammaInfra detects cacheable prefixes and, where supported, injects cache_control breakpoints before the request leaves the gateway:

Anthropic (Claude): Gateway auto-injects breakpoints on the system prompt and tools prefix when the combined token count meets the minimum threshold (1024 tokens for most models, 2048 for Claude Haiku). You do not need to add cache_control markers yourself — the gateway does it for you.
OpenAI, DeepSeek, Google Gemini: Caching is automatic on the provider side. The gateway reads the discounted token counts from the response and bills at the provider’s actual cache-read rate.
Amazon Bedrock: Cache read tokens are billed at the provider’s discounted rate. Bedrock auto-inject is a planned fast-follow.
Groq, Mistral, xAI: Do not report cache tokens. Always billed at standard input rates.

Controlling cache behaviour

Send the X-GammaInfra-Cache request header to override the default:

Value	Effect
`auto` (default)	Gateway injects breakpoints on the system prompt and tools prefix when the prefix meets the minimum token threshold.
`aggressive`	Extends caching to include recent conversation history turns, in addition to the system prompt and tools prefix. Useful for long multi-turn sessions where the conversation context is stable across many calls.
`off`	Disables auto-injection for this request. Use for one-off requests where you do not want to pay the cache-write premium. Provider-side automatic caching (OpenAI, DeepSeek, Gemini) is not affected.
manual (implicit)	If GammaInfra detects that you have already added `cache_control` markers to your messages, it backs off and preserves your markers unchanged. The response will show `X-GammaInfra-Cache-Mode: manual`.

Unknown values for X-GammaInfra-Cache are silently ignored and fall back to the default — the header never returns a 400 error.

Verifying cache hits

Check the response headers on any call:

X-GammaInfra-Cache-Mode: auto
X-GammaInfra-Cache-Read-Tokens: 4096
X-GammaInfra-Cache-Write-Tokens: 512
X-GammaInfra-Cost-USD: 0.000041

X-GammaInfra-Cache-Read-Tokens is the number of input tokens served from cache on this call. X-GammaInfra-Cache-Write-Tokens is the number written to cache (priming cost). Both are absent when zero. X-GammaInfra-Cost-USD is always the all-in cost, inclusive of any cache-read discounts and write premiums.

When caching saves money

Caching wins when the same prefix is reused across at least two calls. On the first call GammaInfra writes the prefix to cache (small premium); on subsequent calls those tokens are served at the provider’s cache-read rate (significant discount). The break-even point is reached quickly for system prompts longer than ~1k tokens that are reused across many requests.

For truly one-off requests where the prefix will never repeat, set X-GammaInfra-Cache: off to skip the write cost.

Error codes

Error responses use a consistent JSON shape:

{
  "error": {
    "message": "Human-readable description",
    "type": "error_type",
    "code": "machine_readable_code",
    "request_id": "uuid"
  }
}

Status	Code	Meaning
`400`	`web_plugin_unsupported`	Request used the `:online` model suffix. GammaInfra doesn't ship web search yet — drop the suffix.
`400`	`free_tier_unavailable`	Request used the `:free` model suffix. GammaInfra has no free tier; new accounts get $3.00 of free balance on signup.
`400`	`provider_excluded`	You pinned a specific `provider/model` while your `provider.only`/`provider.ignore` filter excluded that provider. Drop the pin or adjust the filter.
`401`	`invalid_api_key`	Missing or invalid API key.
`402`	`insufficient_credits`	Managed balance can’t cover the request. Top up from the dashboard.
`402`	`byok_balance_empty`	BYOK prepaid balance exhausted — top up to resume.
`404`	`model_not_found`	Unknown `provider/model`. Check `GET /v1/models` for the live catalogue.
`404`	`generation_not_found`	`GET /v1/generation?id=` couldn't find a record for that `request_id`, or it belongs to a different customer.
`422`	—	Invalid request body (pydantic validation).
`429`	`rate_limit_exceeded`	You hit the 240 req/min per-API-key cap. Respect `Retry-After`.
`429`	—	Provider-side rate limit passed through — respect `Retry-After`.
`501`	`web_plugin_unsupported`	Request body carried `plugins:[{id:"web"}]`. Remove the plugin.
`503`	`providers_down`	All providers in the fallback chain failed.

Got a 503? It means every model in the fallback chain for that task type errored at the same time — usually transient. Retry with exponential backoff. Include X-GammaInfra-Request-Id from the response headers if you file a support ticket.

Rate limits

Per-API-key cap: 240 requests per minute on /v1/chat/completions and /v1/completions (and their /api/v1/* aliases). Sliding 60-second window. Static endpoints (/v1/models, /v1/status, /health, /ready) don't count.
Every response carries X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset (Unix seconds). When you hit the cap you get 429 with a Retry-After header in seconds.
Provider-side rate limits also pass through as 429 with any Retry-After the provider returned.
If you expect sustained traffic above the per-key cap, email hello@gammainfra.com — we raise caps for steady-state volume on request.

Status

Live per-provider uptime, latency, and error counts are published:

status.gammainfra.com — human-readable HTML dashboard, auto-refreshes every 30 s
GET /v1/status — same data as JSON, safe to poll from your own monitoring

Both endpoints are public (no auth). Each provider is marked operational, degraded, or outage based on the rolling 24 h request log plus a live health-check ping.

Support

Email support@gammainfra.com, or join our Discord and open a ticket in #support. Include the X-GammaInfra-Request-Id response header from any failing request — it lets us trace the exact path the request took through the router.

For policy and billing terms, see Terms and Privacy.

FAQ

Common developer questions about the API. For the conceptual overview see the FAQ on the landing page.

What is the GammaInfra API base URL?

https://api.gammainfra.com/v1. The legacy apex https://gammainfra.com/v1/* is preserved as a back-compat alias. Authentication is Authorization: Bearer sk-gammainfra-<your-key>. Both /v1/* and /api/v1/* prefixes are mounted with identical responses.

How do I see the cost of each request?

Every successful response carries three USD cost headers: X-GammaInfra-Cost-USD (total), X-GammaInfra-Input-Cost-USD, and X-GammaInfra-Output-Cost-USD. Sum the totals across a session to know exactly what your workload cost. The X-GammaInfra-Endpoint header tells you which provider/model served the request.

What happens when an upstream provider rate-limits a request?

The router cascades to the next endpoint in the task's fallback chain. The actual cascade is reported in the X-GammaInfra-Fallback-Chain response header. For strict-provider behavior, set provider.only in your request body or pass X-GammaInfra-Routing: literal to constrain the chain.

Can I pin a specific model instead of using smart routing?

Yes — use a host-prefixed model name like openai/gpt-5-mini, anthropic/claude-opus-4-7, or bedrock/us.anthropic.claude-sonnet-4-6. Bare logical names (e.g., claude-opus-4-7) resolve through the registry — the router picks native vs Bedrock based on live p50 latency.

How do I enforce a max latency budget per request?

Set X-GammaInfra-Max-Latency-Ms: <ms> on the request (range 60 to 600 000). On timeout, the upstream call is cancelled and a 504 max_latency_exceeded response is returned. Malformed values are silently dropped — the header never causes a 400 error.

How does BYOK pricing differ from managed?

BYOK uses your own provider API keys (configured in the dashboard) and bills 1% of retail cost_usd per request during the launch window (2% standard) against a separate prepaid BYOK balance. Managed uses GammaInfra's negotiated provider rates with 0% token markup plus a 3% (launch) / 5% (standard) fee at top-up time. When BYOK balance hits $0, requests return 402 byok_balance_empty — never silently fall back to the managed balance.