The AI Gateway automatically distributes requests across multiple providers using sophisticated algorithms that consider latency, provider health, and your custom preferences.

All strategies are rate-limit aware and health-monitored—unhealthy providers are automatically removed and re-added when they recover.

Benefits:

  • Optimize latency by routing to the fastest available providers
  • Improve reliability with automatic failover when providers fail
  • Handle rate limits by temporarily removing rate-limited providers
  • Control traffic distribution with custom weights for cost optimization
  • Enable gradual rollouts and A/B testing across providers

Quick Start

1

Create your configuration

Create ai-gateway-config.yaml with latency-based routing (automatically picks the fastest provider):

routers:
  my-router:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
          - gemini
2

Ensure your provider API keys are set

export OPENAI_API_KEY=sk-your-openai-key
export ANTHROPIC_API_KEY=sk-ant-your-anthropic-key
export GEMINI_API_KEY=your-gemini-key
3

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml
4

Test load balancing

curl -X POST http://localhost:8080/router/my-router/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ The gateway automatically routes to the fastest available provider!

Available Strategies

Additional load balancing strategies (Cost-Optimized, Model-Level Weighted, Tag-based Routing) are coming soon for advanced routing scenarios.

Use Cases

Use case: Customer-facing API where response time is critical. The gateway automatically routes to whichever provider is responding fastest, ensuring optimal user experience.

routers:
  production:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
          - gemini

How It Works

Request Flow

1

Request Arrives

A request comes in for a specific model (e.g., gpt-4o-mini)

2

Provider Selection

The load balancer identifies which providers can handle this model and applies your chosen strategy:

  • Latency strategy: Picks 2 random healthy providers, routes to the one with lower load
  • Weighted strategy: Routes based on your configured percentages
3

Health Check

Before routing, the gateway ensures the selected provider is healthy (not rate-limited, not failing)

4

Request Forwarded

The request is sent to the selected provider with the original model name

5

Response & Learning

The response is returned to you, and the gateway updates its latency/health metrics for future routing decisions

What Gets Load Balanced

The AI Gateway distributes requests across providers (OpenAI, Anthropic, Google, etc.) that support the requested model.

Example: When you request gpt-4o-mini:

  • OpenAI - Native support for gpt-4o-mini
  • Anthropic - Via model mapping to claude-3-5-haiku
  • Ollama - Via model mapping to llama3.2

Example: When you request a model not in any mappings:

  • OpenAI - If OpenAI natively supports it
  • Anthropic - No mapping available
  • Ollama - No mapping available

The load balancer only considers providers that can actually handle your request.

Automatic Health Monitoring

All load balancing strategies automatically handle provider failures through intelligent health monitoring:

Error rate monitoring

Providers with high error rates (default: >10%) are automatically removed

Rate limit detection

Rate-limited providers are temporarily removed and re-added when limits reset

Grace period handling

Providers need minimum requests (default: 20) before being considered for removal

Automatic recovery

Unhealthy providers are periodically retested and re-added when healthy

The AI Gateway monitors provider health every 5 seconds by default. The health check uses a rolling 60-second window with configurable error thresholds.

Strategy Selection Guide

Use CaseRecommended Strategy
Production APIsLatency-based - Automatically optimizes for speed
Provider migrationWeighted - Gradual traffic shifting with instant rollback
A/B testingWeighted - Controlled traffic splits for comparison
Cost optimizationWeighted - Route more traffic to cheaper providers
Compliance routingMultiple AI Gateways - Better isolation

Compliance-Based Routing

For compliance requirements, deploy multiple AI Gateway instances rather than complex routing logic. This provides better isolation, security, and auditability.

Common Scenarios

Use case: European data must stay in Europe.

router-eu.company.com EU-only providers
router-us.company.com Global providers
routers:
  eu-compliant:
    load-balance:
      chat:
        strategy: latency
        providers: [anthropic-eu, openai-eu]

Benefits: Separate networks, authentication, audit trails, and certification scope per deployment.

For complete configuration options and syntax, see the Configuration Reference.

Coming Soon

The following load balancing features are planned for future releases:

FeatureDescriptionVersion
Cost-Optimized StrategyRoute to the cheapest equivalent model - picks the provider that offers the same model or configured equivalent models for the lowest pricev2
Model-Level Weighted StrategyProvider + model specific weighting - configure weights for provider+model pairs (e.g., openai/o1 vs bedrock/claude-3-5-sonnet)v2
Tag-based RoutingHeader-driven routing decisions - route requests to specific providers and models based on tags passed via request headersv3