Complete reference for configuring your LLM Gateway
The AI Gateway is configured through a ai-gateway-config.yaml
file that defines how requests are routed, load balanced, and processed across different LLM providers.
Each Helicone AI Gateway deployment can configure multiple independent routing policies for different use cases. Each router operates with its own load balancing strategy, provider set, and configuration.
Define one or more routers. Each router name becomes part of the URL path when making requests.
Usage: Set your OpenAI SDK baseURL to http://localhost:8080/production
or http://localhost:8080/experimental
Distribute requests across multiple providers to optimize performance, costs, and reliability. The gateway supports latency-based and weighted strategies for different use cases.
Use latency
for automatic load balancing that routes to the provider with the lowest latency.
Use weighted
to distribute requests based on specific percentages.
Important: Weights must sum to exactly 1.0
List of target providers for load balancing.
For Latency Strategy:
For Weighted Strategy:
Store and reuse LLM responses for identical requests to dramatically reduce costs and improve response times. Cache directives control response freshness and staleness tolerance.
Define the cache storage backend. Must be configured at the top level.
In-Memory Storage (Default):
With custom max-size:
Options:
268435456
(256MB worth of entries)1000
(good for development)10000
(good for moderate traffic)536870912
(512MB worth for high traffic)In-Memory Storage (Default):
With custom max-size:
Options:
268435456
(256MB worth of entries)1000
(good for development)10000
(good for moderate traffic)536870912
(512MB worth for high traffic)Redis Storage:
With custom host-url:
Options:
host-url
: Redis connection string (required)redis://[username:password@]host[:port][/database]
Configure response caching for a router.
HTTP cache-control directive string.
How it works: Defines cache freshness (max-age
) and staleness tolerance (max-stale
) in seconds for all requests. Optionally override with cache-control
request headers.
Number of responses stored per cache key before random selection begins.
How it works: Stores n number of different responses for identical requests, then randomly selects from the stored responses to add variability.
Unique seed for cache key generation.
How it works: Creates isolated cache namespaces - different seeds maintain separate cache spaces for the same requests.
Control request frequency using GCRA (Generic Cell Rate Algorithm) with burst capacity and smooth rate limiting. Global limits are checked first, then router-specific limits are applied.
Authentication Required: Rate limiting works per-API-key, so you must enable Helicone authentication for rate limiting to function properly. Set up authentication first.
Configure application-wide rate limits that apply to all requests.
How it works: These limits are checked first for every request across all routers.
Configure additional rate limiting specific to this router (applied after global limits).
How it works: If global limits are configured, they’re checked first. Then these router-specific limits are applied as an additional layer.
The following fields are available for both global and router-level rate limiting:
Storage backend for rate limit counters.
Options:
in-memory
- Local memory storage (default)redis
- Redis storage (coming soon…)Rate limits applied per API key.
Maximum number of requests in the bucket (burst capacity).
How it works: This is the maximum number of requests that can be made instantly before rate limiting kicks in.
Time to completely refill the capacity bucket.
How it works: With capacity=500 and refill-frequency=1s, you get 500 requests per second sustained rate.
How often to clean up expired rate limit entries.
Note: Only available for global rate limiting configuration.
Automatically retry transient errors from AI providers using configurable strategies. Retries use smart failure detection with configurable maximum attempts and backoff timing.
Configure an application-wide retry policy that applies to all requests.
How it works: If a transient error is encountered on any endpoint, the request will be retried with the given policy.
Configure retry policy specific to this router (overrides global retry settings).
How it works: Router-specific retry settings take precedence over global retry configuration.
The following fields are available for both global and router-level retry configuration:
Retry backoff strategy.
Options:
constant
- Fixed delay between retry attempts with jitterexponential
- Exponentially increasing delays with jitter and configurable boundsFixed delay between retry attempts (for constant strategy).
How it works: Each retry waits exactly this duration (plus jitter) before attempting again.
Maximum number of retry attempts.
How it works: After this many failed attempts, the request fails permanently.
Minimum delay for exponential backoff strategy.
How it works: First retry waits this duration, then each subsequent retry increases exponentially.
Maximum delay cap for exponential backoff strategy.
How it works: Delays will never exceed this value, even with exponential growth.
Maximum number of retry attempts.
How it works: After this many failed attempts, the request fails permanently.
Exponential backoff multiplication factor.
How it works: Each retry delay is multiplied by this factor. With factor=2.0: 100ms → 200ms → 400ms → 800ms.
Configure integration with the Helicone platform for authentication and observability.
Learn more about Authentication | Learn more about Observability
Enable Helicone features such as LLM observability and authentication for secure API access.
Enable authentication (required) and request logging to your Helicone dashboard.
When enabled: Must set the HELICONE_CONTROL_PLANE_API_KEY
environment variable with your Helicone API key when deploying
the AI Gateway when using Helicone add ons.
Helicone API endpoint URL.
Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.
WebSocket URL for control plane connection.
Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.
Configure LLM providers, their endpoints, and available models.
The gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.
Configure provider settings to override defaults.
API endpoint URL for the provider.
List of supported models for this provider.
API version (required for some providers like Anthropic).
Define equivalencies between models from different providers for seamless switching and load balancing.
The Gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.
Router-specific model mappings for fallback when requested model isn’t available.
Global fallback mappings used when router-specific mappings aren’t defined.
Configure OpenTelemetry monitoring for the AI Gateway application’s health and performance.
Monitor the AI Gateway’s health and performance with OpenTelemetry. We provide Docker Compose for local testing and Grafana dashboard configs for production.
Logging level in env logger format.
Common patterns:
"info"
- General information for all modules, recommended for production"info,ai_gateway=debug"
- Debug for dependencies, info for gateway, recommended for developmentTelemetry data export destination.
Options:
stdout
- Export telemetry data to standard output (default)otlp
- Export telemetry data to OTLP collector endpointboth
- Export to both stdout and OTLP collectorOTLP collector endpoint URL.
Control which headers are returned to provide visibility into the gateway’s routing decisions and processing.
Add helicone-provider
header showing which provider handled the request.
When enabled: Responses include header like helicone-provider: openai
or helicone-provider: anthropic
.
Add helicone-provider-req-id
header showing the provider’s request ID.
When enabled: Responses include header like helicone-provider-req-id: req-12345
for request tracing.
Configure how the AI Gateway monitors provider health and automatically removes failing providers from load balancing rotation.
Health monitoring strategy.
Options:
error-ratio
- Monitor based on error rate thresholds (only option currently)Error ratio threshold (0.0-1.0) that triggers provider removal.
How it works: If errors/requests exceeds this ratio, provider is marked unhealthy and removed from load balancing.
Time window for measuring error ratios.
How it works: Rolling window size for calculating error rates.
Minimum requests required before health monitoring takes effect.
How it works: Providers won’t be marked unhealthy until they’ve handled at least this many requests.