Load Balancing Strategies - Helicone OSS LLM Observability

The AI Gateway automatically distributes requests across multiple providers using sophisticated algorithms that consider latency, provider health, and your custom preferences. All strategies are rate-limit aware and health-monitored—unhealthy providers are automatically removed and re-added when they recover. Benefits:

Optimize latency by routing to the fastest available providers
Improve reliability with automatic failover when providers fail
Handle rate limits by temporarily removing rate-limited providers
Control traffic distribution with custom weights for cost optimization
Enable gradual rollouts and A/B testing across providers

Quick Start

Create your configuration

Create ai-gateway-config.yaml with latency-based routing (automatically picks the fastest provider):

routers:
  my-router:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
          - gemini

Ensure your provider API keys are set

Set up your .env file with your PROVIDER_API_KEYs:

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GEMINI_API_KEY=your_gemini_key

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml

Test load balancing

curl -X POST http://localhost:8080/router/my-router/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ The gateway automatically routes to the fastest available provider!

Available Strategies

Latency-based (P2C + PeakEWMA) - Default

Power-of-Two-Choices with Peak Exponentially Weighted Moving AverageMaintains a moving average of each provider’s RTT latency, weighted by the number of outstanding requests, to distribute traffic to providers with the least load and optimize for latency.Best for: Production workloads where latency matters mostHow it works:

Randomly selects 2 providers from the healthy pool
Calculates load using RTT weighted by outstanding requests
Routes to the provider with lower load score
Updates moving averages with actual response times

Example:

routers:
  my-router:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
          - gemini

Weighted Strategy

Custom traffic percentages across providersRoutes traffic based on arbitrary weights you specify. For example, if you have providers [A, B, C] with weights [0.80, 0.15, 0.05], then A gets 80% of traffic, B gets 15%, and C gets 5%.Best for: Cost optimization, gradual provider migrations, or compliance requirementsExample:

routers:
  my-router:
    load-balance:
      chat:
        strategy: weighted
        providers:
          - provider: anthropic
            weight: 0.75
          - provider: openai
            weight: 0.25

Additional load balancing strategies (Cost-Optimized, Model-Level Weighted, Tag-based Routing) are coming soon for advanced routing scenarios.

Use Cases

Use case: Customer-facing API where response time is critical. The gateway automatically routes to whichever provider is responding fastest, ensuring optimal user experience.

routers:
  production:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
          - gemini

How It Works

Request Flow

Request Arrives

A request comes in for a specific model (e.g., gpt-4o-mini)

Provider Selection

The load balancer identifies which providers can handle this model and applies your chosen strategy:

Latency strategy: Picks 2 random healthy providers, routes to the one with lower load
Weighted strategy: Routes based on your configured percentages

Health Check

Before routing, the gateway ensures the selected provider is healthy (not rate-limited, not failing)

Request Forwarded

The request is sent to the selected provider with the original model name

Response & Learning

The response is returned to you, and the gateway updates its latency/health metrics for future routing decisions

What Gets Load Balanced

The AI Gateway distributes requests across providers (OpenAI, Anthropic, Google, etc.) that support the requested model. Example: When you request gpt-4o-mini:

✅ OpenAI - Native support for gpt-4o-mini
✅ Anthropic - Via model mapping to claude-3-5-haiku
✅ Ollama - Via model mapping to llama3.2

Example: When you request a model not in any mappings:

✅ OpenAI - If OpenAI natively supports it
❌ Anthropic - No mapping available
❌ Ollama - No mapping available

The load balancer only considers providers that can actually handle your request.

Automatic Health Monitoring

All load balancing strategies automatically handle provider failures through intelligent health monitoring:

Error rate monitoring

Providers with high error rates (default: >10%) are automatically removed

Rate limit detection

Rate-limited providers are temporarily removed and re-added when limits reset

Grace period handling

Providers need minimum requests (default: 20) before being considered for removal

Automatic recovery

Unhealthy providers are periodically retested and re-added when healthy

The AI Gateway monitors provider health every 5 seconds by default. The health check uses a rolling 60-second window with configurable error thresholds.

Strategy Selection Guide

Use Case	Recommended Strategy
Production APIs	Latency-based - Automatically optimizes for speed
Provider migration	Weighted - Gradual traffic shifting with instant rollback
A/B testing	Weighted - Controlled traffic splits for comparison
Cost optimization	Weighted - Route more traffic to cheaper providers
Compliance routing	Multiple AI Gateways - Better isolation

Compliance-Based Routing

For compliance requirements, deploy multiple AI Gateway instances rather than complex routing logic. This provides better isolation, security, and auditability.

Common Scenarios

Use case: European data must stay in Europe.

router-eu.company.com   → EU-only providers
router-us.company.com   → Global providers

routers:
  eu-compliant:
    load-balance:
      chat:
        strategy: latency
        providers: [anthropic-eu, openai-eu]

Benefits: Separate networks, authentication, audit trails, and certification scope per deployment.

For complete configuration options and syntax, see the Configuration Reference.

Coming Soon

The following load balancing features are planned for future releases:

Feature	Description	Version
Cost-Optimized Strategy	Route to the cheapest equivalent model - picks the provider that offers the same model or configured equivalent models for the lowest price	v2
Model-Level Weighted Strategy	Provider + model specific weighting - configure weights for provider+model pairs (e.g., openai/o1 vs bedrock/claude-3-5-sonnet)	v2
Tag-based Routing	Header-driven routing decisions - route requests to specific providers and models based on tags passed via request headers	v3

Guides

​Quick Start

​Available Strategies

​Use Cases

​How It Works

​Request Flow

​What Gets Load Balanced

​Automatic Health Monitoring

Error rate monitoring

Rate limit detection

Grace period handling

Automatic recovery

​Strategy Selection Guide

​Compliance-Based Routing

​Common Scenarios

​Coming Soon

Quick Start

Available Strategies

Use Cases

How It Works

Request Flow

What Gets Load Balanced

Automatic Health Monitoring

Strategy Selection Guide

Compliance-Based Routing

Common Scenarios

Coming Soon