---
name: truefoundry-ai-gateway
description: Configures TrueFoundry AI Gateway for unified OpenAI-compatible LLM access. Covers auth (PAT/VAT), model routing, rate limiting, and budget controls.
license: MIT
compatibility: Requires Bash, curl, and access to a TrueFoundry instance
allowed-tools: Bash(*/tfy-api.sh *) Bash(curl*) Bash(python*)
---
> Routing note: For ambiguous user intents, use the shared clarification templates in [references/intent-clarification.md](references/intent-clarification.md).
# AI Gateway
Use TrueFoundry's AI Gateway to access 1000+ LLMs through a unified OpenAI-compatible API with rate limiting, budget controls, load balancing, routing, and observability.
## When to Use
Access LLMs through TrueFoundry's unified OpenAI-compatible gateway, configure auth tokens (PAT/VAT), set up rate limiting, budget controls, or load balancing across providers.
## When NOT to Use
- User wants to deploy a self-hosted model → deploying self-hosted models requires a TrueFoundry Enterprise account with a connected cluster. See https://truefoundry.com
- User wants to deploy tool servers → deploying workloads requires a TrueFoundry Enterprise account with a connected cluster. See https://truefoundry.com
- User wants to manage TrueFoundry platform credentials → prefer `status` skill; ask if the user wants another valid path
## Overview
The AI Gateway sits between your application and LLM providers:
```
Your App → AI Gateway → OpenAI / Anthropic / Azure / Self-hosted vLLM / etc.
↑
Unified API + Auth + Rate Limiting + Routing + Logging
```
**Key benefits:**
- **Single endpoint** for all models (cloud + self-hosted)
- **One API key** (PAT or VAT) instead of managing per-provider keys
- **OpenAI-compatible** — works with any OpenAI SDK client
- **Rate limiting** per user, team, or application
- **Budget controls** to enforce cost limits
- **Load balancing** across model instances with fallback
- **Observability** — request logging, cost tracking, analytics
## Gateway Endpoint
The gateway base URL is your TrueFoundry platform URL + `/api/llm`:
```
{TFY_BASE_URL}/api/llm
```
Example: `https://your-org.truefoundry.cloud/api/llm`
## Authentication
### Personal Access Token (PAT)
For development and individual use:
1. Go to TrueFoundry dashboard → **Access** → **Personal Access Tokens**
2. Click **New Personal Access Token**
3. Copy the token
### Virtual Access Token (VAT)
For production applications (recommended):
1. Go to TrueFoundry dashboard → **Access** → **Virtual Account Tokens**
2. Click **New Virtual Account** (requires admin privileges)
3. Name it and **select which models** it can access
4. Copy the token
**VATs are recommended for production** because:
- Not tied to a specific user (survives team changes)
- Support granular model access control
- Better for tracking per-application usage
## Calling Models
### Python (OpenAI SDK)
```python
from openai import OpenAI
client = OpenAI(
api_key="",
base_url="https:///api/llm",
)
# Chat completion
response = client.chat.completions.create(
model="openai/gpt-4o", # or any configured model name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
],
max_tokens=200,
)
print(response.choices[0].message.content)
```
### Python (Streaming)
```python
stream = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about AI"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
### cURL
```bash
curl "${TFY_BASE_URL}/api/llm/chat/completions" \
-H "Authorization: Bearer ${TFY_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 200
}'
```
### JavaScript / Node.js
```javascript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "",
baseURL: "https:///api/llm",
});
const response = await client.chat.completions.create({
model: "openai/gpt-4o",
messages: [{ role: "user", content: "Hello!" }],
});
```
### Environment Variables
Set these to use with any OpenAI-compatible library:
```bash
export OPENAI_BASE_URL="${TFY_BASE_URL}/api/llm"
export OPENAI_API_KEY=""
```
Then any code using `openai.OpenAI()` without explicit parameters will use the gateway automatically.
## Supported APIs
| API | Endpoint | Description |
|-----|----------|-------------|
| **Chat Completions** | `/chat/completions` | Chat with any model (streaming + non-streaming) |
| **Completions** | `/completions` | Legacy text completions |
| **Embeddings** | `/embeddings` | Text embeddings (text + list inputs) |
| **Image Generation** | `/images/generations` | Generate images |
| **Image Editing** | `/images/edits` | Edit images |
| **Audio Transcription** | `/audio/transcriptions` | Speech-to-text |
| **Audio Translation** | `/audio/translations` | Translate audio |
| **Text-to-Speech** | `/audio/speech` | Generate speech |
| **Reranking** | `/rerank` | Rerank documents |
| **Batch Processing** | `/batches` | Batch predictions |
| **Moderations** | `/moderations` | Content safety |
## Supported Providers
The gateway supports 25+ providers including:
| Provider | Example Model Names |
|----------|-------------------|
| OpenAI | `openai/gpt-4o`, `openai/gpt-4o-mini` |
| Anthropic | `anthropic/claude-sonnet-4-5-20250929` |
| Google Vertex | `google/gemini-2.0-flash` |
| AWS Bedrock | `bedrock/anthropic.claude-3-5-sonnet` |
| Azure OpenAI | `azure/gpt-4o` |
| Mistral | `mistral/mistral-large-latest` |
| Groq | `groq/llama-3.1-70b-versatile` |
| Cohere | `cohere/command-r-plus` |
| Together AI | `together/meta-llama/Meta-Llama-3.1-70B` |
| Self-hosted (vLLM/TGI) | `my-custom-model-name` |
**Model names depend on how they're configured in your gateway.** Check the TrueFoundry dashboard → AI Gateway → Models for exact names.
## Adding Models & Providers
Currently done through the TrueFoundry dashboard UI:
1. Go to **AI Gateway → Models**
2. Click **Add Provider Account**
3. Select provider (OpenAI, Anthropic, etc.)
4. Enter API credentials
5. Select models to enable
### Adding Self-Hosted Models (Cluster-Internal)
After deploying a self-hosted model:
1. Go to **AI Gateway → Models → Add Provider Account**
2. Select **"Self Hosted"** as the provider type
3. Enter the internal endpoint: `http://{model-name}.{namespace}.svc.cluster.local:8000`
4. The model becomes accessible through the gateway alongside cloud models
> **Security:** Only register model endpoints that you control. External or untrusted model endpoints can return manipulated responses. Use internal cluster DNS (`svc.cluster.local`) for self-hosted models. Verify provider API credentials are stored securely in TrueFoundry secrets, not hardcoded.
### Adding External OpenAI-Compatible APIs (NVIDIA, custom providers)
For externally hosted APIs that are OpenAI-compatible (e.g. NVIDIA Cloud APIs, custom inference endpoints), use `type: provider-account/self-hosted-model` with `auth_data`:
```yaml
# gateway.yaml — External hosted API (e.g. NVIDIA Cloud)
- name: nvidia-external
type: provider-account/self-hosted-model
integrations:
- name: nemotron-nano
type: integration/model/self-hosted-model
hosted_model_name: nvidia/nemotron-3-nano-30b-a3b
url: "https://integrate.api.nvidia.com/v1"
model_server: "openai-compatible"
model_types: ["chat"]
auth_data:
type: bearer-auth
bearer_token: "tfy-secret://::"
```
And in a virtual model routing target, reference it as `"/"`:
```yaml
targets:
- model: "nvidia-external/nemotron-nano" # "/"
```
Apply with:
```bash
tfy apply -f gateway.yaml
```
> **WARNING:** `provider-account/nvidia-nim` does **not** exist in the schema — do not use it. Use `provider-account/self-hosted-model` with `auth_data` for all external OpenAI-compatible APIs (as shown above).
> **Schema source of truth:** For authoritative field names and types, read `servicefoundry-server/src/autogen/models.ts` in the platform repo. Do not guess field names from documentation alone.
## Applying Gateway Config
Gateway YAML is applied directly with `tfy apply` — no service build or Docker image involved:
```bash
# Preview changes
tfy apply -f gateway.yaml --dry-run --show-diff
# Apply
tfy apply -f gateway.yaml
```
**Do NOT delegate gateway applies to a deployment skill.** Gateway configs (`type: gateway-*`, `type: provider-account/*`) are applied inline with `tfy apply`.
**Test after apply:**
```bash
# Quick smoke test via curl
curl "${TFY_BASE_URL}/api/llm/chat/completions" \
-H "Authorization: Bearer ${TFY_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia-external/nemotron-nano",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
```
Or via Python:
```python
from openai import OpenAI
client = OpenAI(api_key="", base_url=f"{TFY_BASE_URL}/api/llm")
resp = client.chat.completions.create(
model="nvidia-external/nemotron-nano",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
```
> **Note:** One-off gateway config applies should use `tfy apply` directly. For CI/CD pipelines, integrate `tfy apply` into your existing automation.
## Virtual Models & Load Balancing
Virtual models route requests across multiple model instances using a `gateway-load-balancing-config` manifest. Targets reference real catalog models as `"/"`.
### Weight-Based Routing
```yaml
name: chat-routing
type: gateway-load-balancing-config
rules:
- id: weighted-chat
type: weight-based-routing
when:
subjects: ["*"]
models: ["openai/gpt-4o"]
load_balance_targets:
- target: "openai-main/gpt-4o"
weight: 70
fallback_candidate: true
retry_config:
delay: 100
attempts: 1
on_status_codes: ["429", "500", "502", "503"]
- target: "azure-backup/gpt-4o"
weight: 30
fallback_candidate: true
retry_config:
delay: 100
attempts: 1
on_status_codes: ["429", "500", "502", "503"]
```
### Latency-Based Routing
Automatically routes to the lowest-latency model (measures time per output token over last 20 minutes):
```yaml
rules:
- id: latency-chat
type: latency-based-routing
when:
subjects: ["*"]
models: ["openai/gpt-4o"]
load_balance_targets:
- target: "openai-main/gpt-4o"
fallback_candidate: true
- target: "azure-backup/gpt-4o"
fallback_candidate: true
```
### Priority-Based Routing
Routes to highest-priority healthy model with SLA cutoff (auto-marks unhealthy when TPOT exceeds threshold):
```yaml
rules:
- id: priority-chat
type: priority-based-routing
when:
subjects: ["team:premium"]
models: ["*"]
load_balance_targets:
- target: "openai-main/gpt-4o"
priority: 0
sla_cutoff:
time_per_output_token_ms: 50
fallback_candidate: true
- target: "azure-backup/gpt-4o"
priority: 1
fallback_candidate: true
```
### Sticky Sessions
Pin users to the same target for a duration:
```yaml
rules:
- id: sticky-chat
type: weight-based-routing
sticky_routing:
ttl_seconds: 3600
session_identifiers:
- key: x-user-id
source: headers
load_balance_targets:
- target: "openai-main/gpt-4o"
weight: 50
- target: "azure-backup/gpt-4o"
weight: 50
```
### Header Overrides Per Target
```yaml
load_balance_targets:
- target: "openai-main/gpt-4o"
weight: 80
headers_override:
set:
x-region: us-east-1
remove:
- x-internal-debug
```
### Fallback Behavior
Fallback is configured per-target inside `load_balance_targets`:
- `fallback_status_codes`: defaults to `["401", "403", "404", "429", "500", "502", "503"]`
- `fallback_candidate: true` marks a target as eligible for failover
- `retry_config.on_status_codes` controls which errors trigger retries
### Apply
```bash
tfy apply -f gateway-load-balancing-config.yaml --dry-run --show-diff
tfy apply -f gateway-load-balancing-config.yaml
```
> **Note:** Targets must be real catalog models, not nested virtual models.
## Rate Limiting
Configure rate limits per user, team, model, or custom metadata using a `gateway-rate-limiting-config` manifest. Only the first matching rule applies — place specific rules before generic ones.
```yaml
name: rate-limits
type: gateway-rate-limiting-config
rules:
- id: "team-rpm-limit"
when:
subjects: ["team:backend"]
models: ["openai-main/gpt-4o"]
limit_to: 20000
unit: tokens_per_minute
- id: "user-daily-limit"
when:
subjects: ["user:bob@example.com"]
models: ["openai-main/gpt-4o"]
limit_to: 1000
unit: requests_per_day
- id: "per-project-hourly"
when: {}
limit_to: 50000
unit: tokens_per_hour
rate_limit_applies_per: ["metadata.project_id"]
- id: "global-fallback"
when: {}
limit_to: 500
unit: requests_per_minute
rate_limit_applies_per: ["user"]
```
**Units:** `requests_per_minute`, `requests_per_hour`, `requests_per_day`, `tokens_per_minute`, `tokens_per_hour`, `tokens_per_day`
**`rate_limit_applies_per`:** Creates separate limits per entity (max 2 values). Options: `user`, `model`, `virtualaccount`, `metadata.`.
```bash
tfy apply -f gateway-rate-limiting-config.yaml
```
## Budget Controls
Enforce cost limits per user, team, or metadata using a `gateway-budget-config` manifest. Costs are tracked automatically based on model pricing.
```yaml
name: budget-controls
type: gateway-budget-config
rules:
- id: "team-monthly-budget"
when:
subjects: ["team:engineering"]
limit_to: 5000
unit: cost_per_month
budget_applies_per: ["team"]
alerts:
thresholds: [75, 90, 100]
notification_target:
- type: email
notification_channel: "budget-alerts"
to_emails: ["lead@example.com"]
- id: "user-daily-budget"
when: {}
limit_to: 100
unit: cost_per_day
budget_applies_per: ["user"]
- id: "project-daily-budget"
when:
metadata:
environment: "production"
limit_to: 200
unit: cost_per_day
budget_applies_per: ["metadata.project_id"]
```
**Units:** `cost_per_day` (resets UTC midnight), `cost_per_week` (resets Monday), `cost_per_month` (resets 1st)
**`budget_applies_per`:** Same options as rate limiting — `user`, `model`, `team`, `virtualaccount`, `metadata.`.
**Alerts:** Configure threshold percentages with email, Slack webhook, or Slack bot notifications.
```bash
tfy apply -f gateway-budget-config.yaml
```
## Observability
### Request Logging
All gateway requests are logged with:
- Input/output tokens
- Latency (TTFT, total)
- Cost
- Model and provider
- User identity
- Custom metadata
### Custom Metadata
Tag requests with custom metadata for tracking:
```python
response = client.chat.completions.create(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
extra_headers={
"X-TFY-LOGGING-CONFIG": '{"project": "my-app", "environment": "production"}'
},
)
```
### Analytics
View usage analytics in TrueFoundry dashboard:
- Requests/minute per model
- Tokens/minute per model
- Failures/minute per model
- Cost breakdown by model, user, team
### OpenTelemetry Integration
Export traces to your observability stack:
- Prometheus + Grafana
- Datadog
- Custom OTEL collectors
## Guardrails
For content filtering, PII detection, prompt injection prevention, and custom safety rules, use the `guardrails` skill. It configures guardrail providers and rules that apply to this gateway's traffic.
## MCP Gateway Attachment Flow
If a user has already deployed a tool server and wants to attach it to MCP gateway:
1. Verify deployment status and endpoint URL via the TrueFoundry dashboard
2. Register the endpoint as an MCP server (`mcp-servers` skill)
3. Confirm registration ID/name and share how to reference it in policies
## Framework Integration
The gateway works with popular AI frameworks:
### LangChain
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="openai/gpt-4o",
api_key="",
base_url="https:///api/llm",
)
```
### LlamaIndex
```python
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="openai/gpt-4o",
api_key="",
api_base="https:///api/llm",
)
```
### Cursor / Claude Code / Cline
Configure the gateway as a custom API endpoint in your coding assistant settings:
- Base URL: `{TFY_BASE_URL}/api/llm`
- API Key: Your PAT or VAT
## Presenting Gateway Info
When the user asks about gateway configuration:
```
AI Gateway:
Endpoint: https://your-org.truefoundry.cloud/api/llm
Auth: Personal Access Token (PAT) or Virtual Access Token (VAT)
Available Models (check dashboard for current list):
| Model Name | Provider | Type |
|-------------------|-------------|-------------|
| openai/gpt-4o | OpenAI | Cloud |
| my-gemma-2b | Self-hosted | vLLM (T4) |
| anthropic/claude | Anthropic | Cloud |
Usage:
export OPENAI_BASE_URL="https://your-org.truefoundry.cloud/api/llm"
export OPENAI_API_KEY="your-token"
# Then use any OpenAI-compatible SDK
```
## Success Criteria
- The user can call LLMs through the gateway endpoint using an OpenAI-compatible SDK or cURL
- The user has a valid authentication token (PAT or VAT) configured for gateway access
- The agent has confirmed the target model name is available in the user's gateway configuration
- The user can verify successful responses from the gateway with correct model output
- The agent has provided working code snippets tailored to the user's language and framework
- Rate limiting, budget controls, or routing are configured if the user requested them
## Composability
- **Deploy model first**: Deploy a self-hosted model (requires TrueFoundry Enterprise), then add to gateway
- **Need API key**: Create PAT/VAT in TrueFoundry dashboard → Access
- **Rate limiting**: Configure in dashboard → AI Gateway → Rate Limiting
- **Routing config**: Apply routing YAML directly with `tfy apply`; for CI/CD pipelines, integrate `tfy apply` into your automation
- **Tool servers**: Deploy tool servers to your infrastructure, then register in gateway
- **Check deployed models**: Check the TrueFoundry dashboard to see running model services
- **Benchmark through gateway**: Use your preferred load-testing tool against gateway endpoints
## Error Handling
### 401 Unauthorized
```
Gateway authentication failed. Check:
- API key (PAT or VAT) is valid and not expired
- Using correct header: Authorization: Bearer
```
### 403 Forbidden
```
Model access denied. Your token may not have access to this model.
- PATs inherit user permissions
- VATs only have access to explicitly selected models
- Check with your admin to grant model access
```
### 429 Rate Limited
```
Rate limit exceeded. Options:
- Wait and retry (check Retry-After header)
- Request higher limits from admin
- Use load balancing to distribute across providers
```
### 502/503 Provider Error
```
Upstream provider error. The gateway will automatically:
- Retry on configured status codes
- Fallback to alternate models if routing is configured
If persistent, check provider status page or self-hosted model health.
```
### Model Not Found
```
Model name not found in gateway. Check:
- Exact model name in TrueFoundry dashboard → AI Gateway → Models
- Provider account is active and model is enabled
- Your token has access to this model
```