The AI Gateway is configured through a ai-gateway-config.yaml file that defines how requests are routed, load balanced, and processed across different LLM providers.

Routers

Each Helicone AI Gateway deployment can configure multiple independent routing policies for different use cases. Each router operates with its own load balancing strategy, provider set, and configuration.

routers
object

Define one or more routers. Each router name becomes part of the URL path when making requests.

routers:
  production:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
  
  development:
    load-balance:
      chat:
        strategy: weighted
        providers:
          - provider: anthropic
            weight: '0.9'
          - provider: openai
            weight: '0.1'

Usage: Set your OpenAI SDK baseURL to http://localhost:8080/production or http://localhost:8080/experimental

Load Balancing

Distribute requests across multiple providers to optimize performance, costs, and reliability. The gateway supports latency-based and weighted strategies for different use cases.

Learn more

Latency Strategy

routers.[name].load-balance.chat.strategy
string

Use latency for automatic load balancing that routes to the provider with the lowest latency.

routers:
  production:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
          - gemini
          - ollama

Weighted Strategy

routers.[name].load-balance.chat.strategy
string

Use weighted to distribute requests based on specific percentages.

routers:
  production:
    load-balance:
      chat:
        strategy: weighted
        providers:
          - provider: anthropic
            weight: '0.95'
          - provider: openai
            weight: '0.05'

Important: Weights must sum to exactly 1.0

Load Balancing Providers

routers.[name].load-balance.chat.providers
array

List of target providers for load balancing.

For Latency Strategy:

providers:
  - openai
  - anthropic
  - gemini

For Weighted Strategy:

providers:
  - provider: anthropic
    weight: '0.7'
  - provider: openai
    weight: '0.3'

Caching

Store and reuse LLM responses for identical requests to dramatically reduce costs and improve response times. Cache directives control response freshness and staleness tolerance.

Learn more

Cache Store

cache-store
object

Define the cache storage backend. Must be configured at the top level.

cache-store:
  in-memory: {}  # Uses default max-size

With custom max-size:

cache-store:
  in-memory:
    max-size: 10000  # Number of cached entries

Options:

  • Default: 268435456 (256MB worth of entries)
  • Small: 1000 (good for development)
  • Medium: 10000 (good for moderate traffic)
  • Large: 536870912 (512MB worth for high traffic)
routers.[name].cache
object

Configure response caching for a router.

cache-store:
  in-memory: {}

routers:
  production:
    cache:
      directive: "max-age=3600, max-stale=1800"
      buckets: 10
      seed: "unique-cache-seed"
routers.[name].cache.directive
string

HTTP cache-control directive string.

cache:
  directive: "max-age=3600, max-stale=1800"

How it works: Defines cache freshness (max-age) and staleness tolerance (max-stale) in seconds for all requests. Optionally override with cache-control request headers.

routers.[name].cache.buckets
number

Number of responses stored per cache key before random selection begins.

cache:
  buckets: 10

How it works: Stores n number of different responses for identical requests, then randomly selects from the stored responses to add variability.

routers.[name].cache.seed
string

Unique seed for cache key generation.

cache:
  seed: "unique-cache-seed"

How it works: Creates isolated cache namespaces - different seeds maintain separate cache spaces for the same requests.

Rate Limiting

Control request frequency using GCRA (Generic Cell Rate Algorithm) with burst capacity and smooth rate limiting. Global limits are checked first, then router-specific limits are applied.

Authentication Required: Rate limiting works per-API-key, so you must enable Helicone authentication for rate limiting to function properly. Set up authentication first.

Learn more

Global Rate Limiting

global.rate-limit
object

Configure application-wide rate limits that apply to all requests.

global:
  rate-limit:
    store: in-memory
    per-api-key:
      capacity: 500
      refill-frequency: 1s

How it works: These limits are checked first for every request across all routers.

Router-Level Rate Limiting

routers.[name].rate-limit
object

Configure additional rate limiting specific to this router (applied after global limits).

routers:
  production:
    rate-limit:
      per-api-key:
        capacity: 100
        refill-frequency: 1m

How it works: If global limits are configured, they’re checked first. Then these router-specific limits are applied as an additional layer.

Rate Limit Configuration Fields

The following fields are available for both global and router-level rate limiting:

[context].rate-limit.store
string
default:"in-memory"

Storage backend for rate limit counters.

store: in-memory

Options:

  • in-memory - Local memory storage (default)
  • redis - Redis storage (coming soon…)
[context].rate-limit.per-api-key
object

Rate limits applied per API key.

per-api-key:
  capacity: 500
  refill-frequency: 1s
[context].rate-limit.per-api-key.capacity
integer
default:"500"

Maximum number of requests in the bucket (burst capacity).

per-api-key:
  capacity: 1000

How it works: This is the maximum number of requests that can be made instantly before rate limiting kicks in.

[context].rate-limit.per-api-key.refill-frequency
duration
default:"1s"

Time to completely refill the capacity bucket.

per-api-key:
  refill-frequency: 1s

How it works: With capacity=500 and refill-frequency=1s, you get 500 requests per second sustained rate.

[context].rate-limit.cleanup-interval
duration
default:"5m"

How often to clean up expired rate limit entries.

cleanup-interval: 5m

Note: Only available for global rate limiting configuration.

Helicone Add-ons

Configure integration with the Helicone platform for authentication and observability. Authentication and observability are now separate controls.

Learn more about Authentication | Learn more about Observability

helicone.authentication
boolean
default:"false"

Enable Helicone authentication for secure API access.

helicone:
  authentication: true
  observability: false  # Optional: enable observability

When enabled: Must set the HELICONE_CONTROL_PLANE_API_KEY environment variable with your Helicone API key.

helicone.observability
boolean
default:"false"

Enable request logging to your Helicone dashboard.

helicone:
  authentication: true  # Required for observability
  observability: true

Note: Observability requires authentication to be enabled.

helicone.base-url
string
default:"https://api.helicone.ai"

Helicone API endpoint URL.

helicone:
  base-url: "https://api.helicone.ai"

Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.

helicone.websocket-url
string

WebSocket URL for control plane connection.

helicone:
  websocket-url: "wss://api.helicone.ai/ws/v1/router/control-plane"

Note: Only change this if you’re self-hosting Helicone. Use the default for Helicone Cloud.

Provider Configuration

Configure LLM providers, their endpoints, and available models.

The gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.

providers
object

Configure provider settings to override defaults.

providers:
  anthropic:
    base-url: "https://api.anthropic.com"
    version: "2023-06-01"
    models:
      - claude-3-5-haiku
  
  ollama:
    base-url: "http://192.168.1.100:11434"
    models:
      - llama3.2
      - deepseek-r1
      - custom-fine-tuned-model
  
  bedrock:
    base-url: "https://bedrock-runtime.us-west-2.amazonaws.com"
    models:
      - anthropic.claude-3-5-sonnet-20241022-v2:0
      - anthropic.claude-3-haiku-20240307-v1:0
providers.[name].base-url
string
required

API endpoint URL for the provider.

providers:
  openai:
    base-url: "https://api.openai.com"
providers.[name].models
array
required

List of supported models for this provider.

providers:
  openai:
    models:
      - gpt-4
      - gpt-4o
      - gpt-4o-mini
providers.[name].version
string

API version (required for some providers like Anthropic).

providers:
  anthropic:
    version: "2023-06-01"

Model Mapping

Define equivalencies between models from different providers for seamless switching and load balancing.

The Gateway ships with comprehensive defaults for all major providers. Most users will not need to configure this section, this guide will walk you through when you might need to.

routers.[name].model-mappings
object

Router-specific model mappings for fallback when requested model isn’t available.

routers:
  production:
    model-mappings:
      gpt-4o: claude-3-opus
      claude-3-5-sonnet: gemini-1.5-pro
      gpt-4o-mini: claude-3-5-sonnet
default-model-mapping
object

Global fallback mappings used when router-specific mappings aren’t defined.

default-model-mapping:
  gpt-4o: claude-3-opus
  gpt-4o-mini: claude-3-5-sonnet
  claude-3-5-sonnet: gemini-1.5-pro

Telemetry

Configure OpenTelemetry monitoring for the AI Gateway application’s health and performance.

Monitor the AI Gateway’s health and performance with OpenTelemetry. We provide Docker Compose for local testing and Grafana dashboard configs for production.

Learn more

telemetry.level
string
default:"info"

Logging level in env logger format.

telemetry:
  level: "info"

Common patterns:

  • "info" - General information for all modules, recommended for production
  • "info,ai_gateway=debug" - Debug for dependencies, info for gateway, recommended for development
telemetry.exporter
string
default:"stdout"

Telemetry data export destination.

telemetry:
  exporter: "otlp"

Options:

  • stdout - Export telemetry data to standard output (default)
  • otlp - Export telemetry data to OTLP collector endpoint
  • both - Export to both stdout and OTLP collector
telemetry.otlp-endpoint
string
default:"http://localhost:4317/v1/metrics"

OTLP collector endpoint URL.

telemetry:
  otlp-endpoint: "http://localhost:4317"

Response Headers

Control which headers are returned to provide visibility into the gateway’s routing decisions and processing.

response-headers.provider
boolean
default:"true"

Add helicone-provider header showing which provider handled the request.

response-headers:
  provider: true

When enabled: Responses include header like helicone-provider: openai or helicone-provider: anthropic.

response-headers.provider-request-id
boolean
default:"true"

Add helicone-provider-req-id header showing the provider’s request ID.

response-headers:
  provider-request-id: true

When enabled: Responses include header like helicone-provider-req-id: req-12345 for request tracing.

Health Monitoring

Configure how the AI Gateway monitors provider health and automatically removes failing providers from load balancing rotation.

discover.monitor.health.type
string

Health monitoring strategy.

discover:
  monitor:
    health:
      ratio: 0.1
      window: 60s
      grace-period:
        min-requests: 20

Options:

  • error-ratio - Monitor based on error rate thresholds (only option currently)
discover.monitor.health.ratio
number
default:"0.1"

Error ratio threshold (0.0-1.0) that triggers provider removal.

discover:
  monitor:
    health:
      ratio: 0.15

How it works: If errors/requests exceeds this ratio, provider is marked unhealthy and removed from load balancing.

discover.monitor.health.window
duration
default:"60s"

Time window for measuring error ratios.

discover:
  monitor:
    health:
      window: 60s

How it works: Rolling window size for calculating error rates.

discover.monitor.health.grace-period.min-requests
integer
default:"20"

Minimum requests required before health monitoring takes effect.

discover:
  monitor:
    health:
      grace-period:
        min-requests: 20

How it works: Providers won’t be marked unhealthy until they’ve handled at least this many requests.