The AI Gateway is configured through a ai-gateway-config.yaml file that defines how requests are routed, load balanced, and processed across different LLM providers.

Routers

Each Helicone AI Gateway deployment can configure multiple independent routing policies for different use cases. Each router operates with its own load balancing strategy, provider set, and configuration.
routers
object
Define one or more routers. Each router name becomes part of the URL path when making requests.
routers:
  latency:
    load-balance:
      chat:
        strategy: provider-latency
        providers:
          - openai
          - anthropic
  
  weighted:
    load-balance:
      chat:
        strategy: provider-weighted
        providers:
          - provider: anthropic
            weight: '0.9'
          - provider: openai
            weight: '0.1'
Usage: Set your OpenAI SDK baseURL to http://localhost:8080/router/latency or http://localhost:8080/router/weighted

Load Balancing

Distribute requests across multiple providers to optimize performance, costs, and reliability. The gateway supports latency-based and weighted strategies for different use cases. Learn more

Latency Strategy

routers.[name].load-balance.chat.strategy
string
Use latency for automatic load balancing that routes to the provider with the lowest latency.
routers:
  production:
    load-balance:
      chat:
        strategy: provider-latency
        providers:
          - openai
          - anthropic
          - gemini
          - ollama

Weighted Strategy

routers.[name].load-balance.chat.strategy
string
Use weighted to distribute requests based on specific percentages.
routers:
  production:
    load-balance:
      chat:
        strategy: provider-weighted
        providers:
          - provider: anthropic
            weight: '0.95'
          - provider: openai
            weight: '0.05'
Important: Weights must sum to exactly 1.0

Load Balancing Providers

routers.[name].load-balance.chat.providers
array
List of target providers for load balancing.For Latency Strategy:
providers:
  - openai
  - anthropic
  - gemini
For Weighted Strategy:
providers:
  - provider: anthropic
    weight: '0.7'
  - provider: openai
    weight: '0.3'

Caching

Store and reuse LLM responses for identical requests to dramatically reduce costs and improve response times. Cache directives control response freshness and staleness tolerance. Learn more

Cache Store

cache-store
object
Define the cache storage backend. Must be configured at the top level.
In-Memory Storage (Default):
cache-store:
  type: "in-memory"
With custom max-size:
cache-store:
  type: "in-memory"
  max-size: 10000  # Number of cached entries
Options:
  • Default: 268435456 (256MB worth of entries)
  • Small: 1000 (good for development)
  • Medium: 10000 (good for moderate traffic)
  • Large: 536870912 (512MB worth for high traffic)
routers.[name].cache
object
Configure response caching for a router.
cache-store:
  type: "in-memory"

routers:
  production:
    cache:
      directive: "max-age=3600, max-stale=1800"
      buckets: 10
      seed: "unique-cache-seed"
routers.[name].cache.directive
string
HTTP cache-control directive string.
cache:
  directive: "max-age=3600, max-stale=1800"
How it works: Defines cache freshness (max-age) and staleness tolerance (max-stale) in seconds for all requests. Optionally override with cache-control request headers.
routers.[name].cache.buckets
number
Number of responses stored per cache key before random selection begins.
cache:
  buckets: 10
How it works: Stores n number of different responses for identical requests, then randomly selects from the stored responses to add variability.
routers.[name].cache.seed
string
Unique seed for cache key generation.
cache:
  seed: "unique-cache-seed"
How it works: Creates isolated cache namespaces - different seeds maintain separate cache spaces for the same requests.

Rate Limiting

Control request frequency using GCRA (Generic Cell Rate Algorithm) with burst capacity and smooth rate limiting. Global limits are checked first, then router-specific limits are applied.
Authentication Required: Rate limiting works per-API-key, so you must enable Helicone authentication for rate limiting to function properly. Set up authentication first.
Learn more

Rate Limit Store

rate-limit-store
object
Define the rate limit storage backend. Must be configured at the top level.
In-Memory Storage (Default):
rate-limit-store:
  type: "in-memory"

Global Rate Limiting

global.rate-limit
object
Configure application-wide rate limits that apply to all requests.
global:
  rate-limit:
    per-api-key:
      capacity: 500
      refill-frequency: 1s
How it works: These limits are checked first for every request across all routers.

Router-Level Rate Limiting

routers.[name].rate-limit
object
Configure additional rate limiting specific to this router (applied after global limits).
routers:
  production:
    rate-limit:
      per-api-key:
        capacity: 100
        refill-frequency: 1m
How it works: If global limits are configured, they’re checked first. Then these router-specific limits are applied as an additional layer.
routers.[name].rate-limit.store
string
default:"in-memory"
You can choose to override the rate limit set in the rate-limit-store configuration.
routers:
  production:
    rate-limit:
      store:
        type: in-memory
      per-api-key:
        ...
Options: Refer to the Rate Limit Store section for more information.

Rate Limit Configuration Fields

The following fields are available for both global and router-level rate limiting:
[context].rate-limit.per-api-key
object
Rate limits applied per API key.
per-api-key:
  capacity: 500
  refill-frequency: 1s
[context].rate-limit.per-api-key.capacity
integer
default:"500"
Maximum number of requests in the bucket (burst capacity).
per-api-key:
  capacity: 1000
How it works: This is the maximum number of requests that can be made instantly before rate limiting kicks in.
[context].rate-limit.per-api-key.refill-frequency
duration
default:"1s"
Time to completely refill the capacity bucket.
per-api-key:
  refill-frequency: 1s
How it works: With capacity=500 and refill-frequency=1s, you get 500 requests per second sustained rate.
[context].rate-limit.cleanup-interval
duration
default:"5m"
How often to clean up expired rate limit entries.
cleanup-interval: 5m
Note: Only available for global rate limiting configuration.

Retries

Automatically retry transient errors from AI providers using configurable strategies. Retries use smart failure detection with configurable maximum attempts and backoff timing. Learn more

Global Retry Configuration

global.retries
object
Configure an application-wide retry policy that applies to all requests.
global:
  retries:
    strategy: "constant"
    delay: "50ms"
    max-retries: 3
How it works: If a transient error is encountered on any endpoint, the request will be retried with the given policy.

Router-Level Retry Configuration

routers.[name].retries
object
Configure retry policy specific to this router (overrides global retry settings).
routers:
  production:
    retries:
      strategy: "exponential"
      min-delay: "100ms"
      max-delay: "30s"
      max-retries: 5
      factor: 2.0
How it works: Router-specific retry settings take precedence over global retry configuration.

Retry Strategy Fields

The following fields are available for both global and router-level retry configuration:
[context].retries.strategy
string
required
Retry backoff strategy.
retries:
  strategy: "constant"
Options:
  • constant - Fixed delay between retry attempts with jitter
  • exponential - Exponentially increasing delays with jitter and configurable bounds

Constant Strategy Fields

[context].retries.delay
duration
default:"1s"
Fixed delay between retry attempts (for constant strategy).
retries:
  strategy: "constant"
  delay: "50ms"
How it works: Each retry waits exactly this duration (plus jitter) before attempting again.
[context].retries.max-retries
integer
default:"2"
Maximum number of retry attempts.
retries:
  strategy: "constant"
  max-retries: 3
How it works: After this many failed attempts, the request fails permanently.

Exponential Strategy Fields

[context].retries.min-delay
duration
default:"1s"
Minimum delay for exponential backoff strategy.
retries:
  strategy: "exponential"
  min-delay: "100ms"
How it works: First retry waits this duration, then each subsequent retry increases exponentially.
[context].retries.max-delay
duration
default:"30s"
Maximum delay cap for exponential backoff strategy.
retries:
  strategy: "exponential"
  max-delay: "60s"
How it works: Delays will never exceed this value, even with exponential growth.
[context].retries.max-retries
integer
default:"2"
Maximum number of retry attempts.
retries:
  strategy: "exponential"
  max-retries: 5
How it works: After this many failed attempts, the request fails permanently.
[context].retries.factor
number
default:"2.0"
Exponential backoff multiplication factor.
retries:
  strategy: "exponential"
  factor: 1.5
How it works: Each retry delay is multiplied by this factor. With factor=2.0: 100ms → 200ms → 400ms → 800ms.

Helicone Add-ons

Configure integration with the Helicone platform for authentication, observability, and prompt management. Learn more about Authentication | Learn more about Observability | Learn more about Prompts
helicone.features
string
default:"none"
Enable Helicone features for your AI Gateway.Authentication only:
helicone:
  features: auth
Authentication and observability:
helicone:
  features: observability
Authentication and prompt management:
helicone:
  features: prompts
All features (authentication, observability, and prompts):
helicone:
  features: all
Available options:
  • auth - Enable authentication for secure API access
  • observability - Enable authentication and request logging to your Helicone dashboard
  • prompts - Enable authentication and prompt management with prompt_id support
  • all - Enable all Helicone features (authentication, observability, and prompts)
When enabled: Must set the HELICONE_CONTROL_PLANE_API_KEY environment variable with your Helicone API key when deploying the AI Gateway.
helicone.base-url
string
default:"https://api.helicone.ai"
Helicone API endpoint URL.
helicone:
  base-url: "https://api.helicone.ai"
Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.
helicone.websocket-url
string
WebSocket URL for control plane connection.
helicone:
  websocket-url: "wss://api.helicone.ai/ws/v1/router/control-plane"
Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.

Provider Configuration

Configure LLM providers, their endpoints, and available models.
The gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.
providers
object
Configure provider settings to override defaults.
providers:
  anthropic:
    base-url: "https://api.anthropic.com"
    version: "2023-06-01"
    models:
      - claude-3-5-haiku
  
  ollama:
    base-url: "http://192.168.1.100:11434"
    models:
      - llama3.2
      - deepseek-r1
      - custom-fine-tuned-model
  
  bedrock:
    base-url: "https://bedrock-runtime.us-west-2.amazonaws.com"
    models:
      - anthropic.claude-3-5-sonnet-20241022-v2:0
      - anthropic.claude-3-haiku-20240307-v1:0
providers.[name].base-url
string
required
API endpoint URL for the provider.
providers:
  openai:
    base-url: "https://api.openai.com"
providers.[name].models
array
required
List of supported models for this provider.
providers:
  openai:
    models:
      - gpt-4
      - gpt-4o
      - gpt-4o-mini
providers.[name].version
string
API version (required for some providers like Anthropic).
providers:
  anthropic:
    version: "2023-06-01"

Model Mapping

Define equivalencies between models from different providers for seamless switching and load balancing.
The Gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.
routers.[name].model-mappings
object
Router-specific model mappings for fallback when requested model isn’t available.
routers:
  production:
    model-mappings:
      gpt-4o: claude-3-opus
      claude-3-5-sonnet: gemini-1.5-pro
      gpt-4o-mini: claude-3-5-sonnet
default-model-mapping
object
Global fallback mappings used when router-specific mappings aren’t defined.
default-model-mapping:
  gpt-4o: claude-3-opus
  gpt-4o-mini: claude-3-5-sonnet
  claude-3-5-sonnet: gemini-1.5-pro

Telemetry

Configure OpenTelemetry monitoring for the AI Gateway application’s health and performance.
Monitor the AI Gateway’s health and performance with OpenTelemetry. We provide Docker Compose for local testing and Grafana dashboard configs for production.Learn more
telemetry.level
string
default:"info"
Logging level in env logger format.
telemetry:
  level: "info"
Common patterns:
  • "info" - General information for all modules, recommended for production
  • "info,ai_gateway=debug" - Debug for dependencies, info for gateway, recommended for development
telemetry.exporter
string
default:"stdout"
Telemetry data export destination.
telemetry:
  exporter: "otlp"
Options:
  • stdout - Export telemetry data to standard output (default)
  • otlp - Export telemetry data to OTLP collector endpoint
  • both - Export to both stdout and OTLP collector
telemetry.otlp-endpoint
string
default:"http://localhost:4317/v1/metrics"
OTLP collector endpoint URL.
telemetry:
  otlp-endpoint: "http://localhost:4317"

Response Headers

Control which headers are returned to provide visibility into the gateway’s routing decisions and processing.
response-headers.provider
boolean
default:"true"
Add helicone-provider header showing which provider handled the request.
response-headers:
  provider: true
When enabled: Responses include header like helicone-provider: openai or helicone-provider: anthropic.
response-headers.provider-request-id
boolean
default:"true"
Add helicone-provider-req-id header showing the provider’s request ID.
response-headers:
  provider-request-id: true
When enabled: Responses include header like helicone-provider-req-id: req-12345 for request tracing.

Health Monitoring

Configure how the AI Gateway monitors provider health and automatically removes failing providers from load balancing rotation.
discover.monitor.health.type
string
Health monitoring strategy.
discover:
  monitor:
    health:
      ratio: 0.1
      window: 60s
      grace-period:
        min-requests: 20
Options:
  • error-ratio - Monitor based on error rate thresholds (only option currently)
discover.monitor.health.ratio
number
default:"0.1"
Error ratio threshold (0.0-1.0) that triggers provider removal.
discover:
  monitor:
    health:
      ratio: 0.15
How it works: If errors/requests exceeds this ratio, provider is marked unhealthy and removed from load balancing.
discover.monitor.health.window
duration
default:"60s"
Time window for measuring error ratios.
discover:
  monitor:
    health:
      window: 60s
How it works: Rolling window size for calculating error rates.
discover.monitor.health.grace-period.min-requests
integer
default:"20"
Minimum requests required before health monitoring takes effect.
discover:
  monitor:
    health:
      grace-period:
        min-requests: 20
How it works: Providers won’t be marked unhealthy until they’ve handled at least this many requests.