Configuration Reference
Complete reference for configuring your LLM Gateway
The AI Gateway is configured through a ai-gateway-config.yaml
file that defines how requests are routed, load balanced, and processed across different LLM providers.
Routers
Each Helicone AI Gateway deployment can configure multiple independent routing policies for different use cases. Each router operates with its own load balancing strategy, provider set, and configuration.
Define one or more routers. Each router name becomes part of the URL path when making requests.
Usage: Set your OpenAI SDK baseURL to http://localhost:8080/production
or http://localhost:8080/experimental
Load Balancing
Distribute requests across multiple providers to optimize performance, costs, and reliability. The gateway supports latency-based and weighted strategies for different use cases.
Latency Strategy
Use latency
for automatic load balancing that routes to the provider with the lowest latency.
Weighted Strategy
Use weighted
to distribute requests based on specific percentages.
Important: Weights must sum to exactly 1.0
Load Balancing Providers
List of target providers for load balancing.
For Latency Strategy:
For Weighted Strategy:
Caching
Store and reuse LLM responses for identical requests to dramatically reduce costs and improve response times. Cache directives control response freshness and staleness tolerance.
Cache Store
Define the cache storage backend. Must be configured at the top level.
With custom max-size:
Options:
- Default:
268435456
(256MB worth of entries) - Small:
1000
(good for development) - Medium:
10000
(good for moderate traffic) - Large:
536870912
(512MB worth for high traffic)
Configure response caching for a router.
HTTP cache-control directive string.
How it works: Defines cache freshness (max-age
) and staleness tolerance (max-stale
) in seconds for all requests. Optionally override with cache-control
request headers.
Number of responses stored per cache key before random selection begins.
How it works: Stores n number of different responses for identical requests, then randomly selects from the stored responses to add variability.
Unique seed for cache key generation.
How it works: Creates isolated cache namespaces - different seeds maintain separate cache spaces for the same requests.
Rate Limiting
Control request frequency using GCRA (Generic Cell Rate Algorithm) with burst capacity and smooth rate limiting. Global limits are checked first, then router-specific limits are applied.
Authentication Required: Rate limiting works per-API-key, so you must enable Helicone authentication for rate limiting to function properly. Set up authentication first.
Global Rate Limiting
Configure application-wide rate limits that apply to all requests.
How it works: These limits are checked first for every request across all routers.
Router-Level Rate Limiting
Configure additional rate limiting specific to this router (applied after global limits).
How it works: If global limits are configured, they’re checked first. Then these router-specific limits are applied as an additional layer.
Rate Limit Configuration Fields
The following fields are available for both global and router-level rate limiting:
Storage backend for rate limit counters.
Options:
in-memory
- Local memory storage (default)redis
- Redis storage (coming soon…)
Rate limits applied per API key.
Maximum number of requests in the bucket (burst capacity).
How it works: This is the maximum number of requests that can be made instantly before rate limiting kicks in.
Time to completely refill the capacity bucket.
How it works: With capacity=500 and refill-frequency=1s, you get 500 requests per second sustained rate.
How often to clean up expired rate limit entries.
Note: Only available for global rate limiting configuration.
Helicone Add-ons
Configure integration with the Helicone platform for authentication and observability. Authentication and observability are now separate controls.
Learn more about Authentication | Learn more about Observability
Enable Helicone authentication for secure API access.
When enabled: Must set the HELICONE_CONTROL_PLANE_API_KEY
environment variable with your Helicone API key.
Enable request logging to your Helicone dashboard.
Note: Observability requires authentication to be enabled.
Helicone API endpoint URL.
Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.
WebSocket URL for control plane connection.
Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.
Provider Configuration
Configure LLM providers, their endpoints, and available models.
The gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.
Configure provider settings to override defaults.
API endpoint URL for the provider.
List of supported models for this provider.
API version (required for some providers like Anthropic).
Model Mapping
Define equivalencies between models from different providers for seamless switching and load balancing.
The Gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.
Router-specific model mappings for fallback when requested model isn’t available.
Global fallback mappings used when router-specific mappings aren’t defined.
Telemetry
Configure OpenTelemetry monitoring for the AI Gateway application’s health and performance.
Monitor the AI Gateway’s health and performance with OpenTelemetry. We provide Docker Compose for local testing and Grafana dashboard configs for production.
Logging level in env logger format.
Common patterns:
"info"
- General information for all modules, recommended for production"info,ai_gateway=debug"
- Debug for dependencies, info for gateway, recommended for development
Telemetry data export destination.
Options:
stdout
- Export telemetry data to standard output (default)otlp
- Export telemetry data to OTLP collector endpointboth
- Export to both stdout and OTLP collector
OTLP collector endpoint URL.
Response Headers
Control which headers are returned to provide visibility into the gateway’s routing decisions and processing.
Add helicone-provider
header showing which provider handled the request.
When enabled: Responses include header like helicone-provider: openai
or helicone-provider: anthropic
.
Add helicone-provider-req-id
header showing the provider’s request ID.
When enabled: Responses include header like helicone-provider-req-id: req-12345
for request tracing.
Health Monitoring
Configure how the AI Gateway monitors provider health and automatically removes failing providers from load balancing rotation.
Health monitoring strategy.
Options:
error-ratio
- Monitor based on error rate thresholds (only option currently)
Error ratio threshold (0.0-1.0) that triggers provider removal.
How it works: If errors/requests exceeds this ratio, provider is marked unhealthy and removed from load balancing.
Time window for measuring error ratios.
How it works: Rolling window size for calculating error rates.
Minimum requests required before health monitoring takes effect.
How it works: Providers won’t be marked unhealthy until they’ve handled at least this many requests.