Load Balancing Strategies
Intelligent request routing across providers with latency-based P2C and weighted algorithms
The AI Gateway automatically distributes requests across multiple providers using sophisticated algorithms that consider latency, provider health, and your custom preferences.
All strategies are rate-limit aware and health-monitored—unhealthy providers are automatically removed and re-added when they recover.
Benefits:
- Optimize latency by routing to the fastest available providers
- Improve reliability with automatic failover when providers fail
- Handle rate limits by temporarily removing rate-limited providers
- Control traffic distribution with custom weights for cost optimization
- Enable gradual rollouts and A/B testing across providers
Quick Start
Create your configuration
Create ai-gateway-config.yaml
with latency-based routing (automatically picks the fastest provider):
Ensure your provider API keys are set
Start the gateway
Test load balancing
✅ The gateway automatically routes to the fastest available provider!
Available Strategies
Additional load balancing strategies (Cost-Optimized, Model-Level Weighted, Tag-based Routing) are coming soon for advanced routing scenarios.
Use Cases
Use case: Customer-facing API where response time is critical. The gateway automatically routes to whichever provider is responding fastest, ensuring optimal user experience.
Use case: Customer-facing API where response time is critical. The gateway automatically routes to whichever provider is responding fastest, ensuring optimal user experience.
Use case: Testing a new provider’s quality and performance with 10% of traffic before committing to larger rollout. Monitor metrics to compare providers safely.
Use case: Gradual migration from OpenAI to Anthropic. Start at 30/70, monitor for issues, then adjust weights weekly until fully migrated. Allows instant rollback if problems occur.
Use case: Development uses free local Ollama models to reduce costs during testing, while production uses cloud providers with latency optimization for real users.
How It Works
Request Flow
Request Arrives
A request comes in for a specific model (e.g., gpt-4o-mini
)
Provider Selection
The load balancer identifies which providers can handle this model and applies your chosen strategy:
- Latency strategy: Picks 2 random healthy providers, routes to the one with lower load
- Weighted strategy: Routes based on your configured percentages
Health Check
Before routing, the gateway ensures the selected provider is healthy (not rate-limited, not failing)
Request Forwarded
The request is sent to the selected provider with the original model name
Response & Learning
The response is returned to you, and the gateway updates its latency/health metrics for future routing decisions
What Gets Load Balanced
The AI Gateway distributes requests across providers (OpenAI, Anthropic, Google, etc.) that support the requested model.
Example: When you request gpt-4o-mini
:
- ✅ OpenAI - Native support for
gpt-4o-mini
- ✅ Anthropic - Via model mapping to
claude-3-5-haiku
- ✅ Ollama - Via model mapping to
llama3.2
Example: When you request a model not in any mappings:
- ✅ OpenAI - If OpenAI natively supports it
- ❌ Anthropic - No mapping available
- ❌ Ollama - No mapping available
The load balancer only considers providers that can actually handle your request.
Automatic Health Monitoring
All load balancing strategies automatically handle provider failures through intelligent health monitoring:
Error rate monitoring
Providers with high error rates (default: >10%) are automatically removed
Rate limit detection
Rate-limited providers are temporarily removed and re-added when limits reset
Grace period handling
Providers need minimum requests (default: 20) before being considered for removal
Automatic recovery
Unhealthy providers are periodically retested and re-added when healthy
The AI Gateway monitors provider health every 5 seconds by default. The health check uses a rolling 60-second window with configurable error thresholds.
Strategy Selection Guide
Use Case | Recommended Strategy |
---|---|
Production APIs | Latency-based - Automatically optimizes for speed |
Provider migration | Weighted - Gradual traffic shifting with instant rollback |
A/B testing | Weighted - Controlled traffic splits for comparison |
Cost optimization | Weighted - Route more traffic to cheaper providers |
Compliance routing | Multiple AI Gateways - Better isolation |
Compliance-Based Routing
For compliance requirements, deploy multiple AI Gateway instances rather than complex routing logic. This provides better isolation, security, and auditability.
Common Scenarios
Use case: European data must stay in Europe.
Use case: European data must stay in Europe.
Use case: Patient data requires HIPAA-compliant providers.
Use case: Different security clearance levels.
Benefits: Separate networks, authentication, audit trails, and certification scope per deployment.
For complete configuration options and syntax, see the Configuration Reference.
Coming Soon
The following load balancing features are planned for future releases:
Feature | Description | Version |
---|---|---|
Cost-Optimized Strategy | Route to the cheapest equivalent model - picks the provider that offers the same model or configured equivalent models for the lowest price | v2 |
Model-Level Weighted Strategy | Provider + model specific weighting - configure weights for provider+model pairs (e.g., openai/o1 vs bedrock/claude-3-5-sonnet) | v2 |
Tag-based Routing | Header-driven routing decisions - route requests to specific providers and models based on tags passed via request headers | v3 |