Intelligent request routing across providers with latency-based P2C and weighted algorithms
The AI Gateway automatically distributes requests across multiple providers using sophisticated algorithms that consider latency, provider health, and your custom preferences.
All strategies are rate-limit aware and health-monitored—unhealthy providers are automatically removed and re-added when they recover.
Benefits:
Create your configuration
Create ai-gateway-config.yaml
with latency-based routing (automatically picks the fastest provider):
Ensure your provider API keys are set
Set up your .env file with your PROVIDER_API_KEYs:
Start the gateway
Test load balancing
✅ The gateway automatically routes to the fastest available provider!
Latency-based (P2C + PeakEWMA) - Default
Power-of-Two-Choices with Peak Exponentially Weighted Moving Average
Maintains a moving average of each provider’s RTT latency, weighted by the number of outstanding requests, to distribute traffic to providers with the least load and optimize for latency.
Best for: Production workloads where latency matters most
How it works:
Example:
Weighted Strategy
Custom traffic percentages across providers
Routes traffic based on arbitrary weights you specify. For example, if you have providers [A, B, C] with weights [0.80, 0.15, 0.05], then A gets 80% of traffic, B gets 15%, and C gets 5%.
Best for: Cost optimization, gradual provider migrations, or compliance requirements
Example:
Additional load balancing strategies (Cost-Optimized, Model-Level Weighted, Tag-based Routing) are coming soon for advanced routing scenarios.
Use case: Customer-facing API where response time is critical. The gateway automatically routes to whichever provider is responding fastest, ensuring optimal user experience.
Use case: Customer-facing API where response time is critical. The gateway automatically routes to whichever provider is responding fastest, ensuring optimal user experience.
Use case: Testing a new provider’s quality and performance with 10% of traffic before committing to larger rollout. Monitor metrics to compare providers safely.
Use case: Gradual migration from OpenAI to Anthropic. Start at 30/70, monitor for issues, then adjust weights weekly until fully migrated. Allows instant rollback if problems occur.
Use case: Development uses free local Ollama models to reduce costs during testing, while production uses cloud providers with latency optimization for real users.
Request Arrives
A request comes in for a specific model (e.g., gpt-4o-mini
)
Provider Selection
The load balancer identifies which providers can handle this model and applies your chosen strategy:
Health Check
Before routing, the gateway ensures the selected provider is healthy (not rate-limited, not failing)
Request Forwarded
The request is sent to the selected provider with the original model name
Response & Learning
The response is returned to you, and the gateway updates its latency/health metrics for future routing decisions
The AI Gateway distributes requests across providers (OpenAI, Anthropic, Google, etc.) that support the requested model.
Example: When you request gpt-4o-mini
:
gpt-4o-mini
claude-3-5-haiku
llama3.2
Example: When you request a model not in any mappings:
The load balancer only considers providers that can actually handle your request.
All load balancing strategies automatically handle provider failures through intelligent health monitoring:
Providers with high error rates (default: >10%) are automatically removed
Rate-limited providers are temporarily removed and re-added when limits reset
Providers need minimum requests (default: 20) before being considered for removal
Unhealthy providers are periodically retested and re-added when healthy
The AI Gateway monitors provider health every 5 seconds by default. The health check uses a rolling 60-second window with configurable error thresholds.
Use Case | Recommended Strategy |
---|---|
Production APIs | Latency-based - Automatically optimizes for speed |
Provider migration | Weighted - Gradual traffic shifting with instant rollback |
A/B testing | Weighted - Controlled traffic splits for comparison |
Cost optimization | Weighted - Route more traffic to cheaper providers |
Compliance routing | Multiple AI Gateways - Better isolation |
For compliance requirements, deploy multiple AI Gateway instances rather than complex routing logic. This provides better isolation, security, and auditability.
Use case: European data must stay in Europe.
Use case: European data must stay in Europe.
Use case: Patient data requires HIPAA-compliant providers.
Use case: Different security clearance levels.
Benefits: Separate networks, authentication, audit trails, and certification scope per deployment.
For complete configuration options and syntax, see the Configuration Reference.
The following load balancing features are planned for future releases:
Feature | Description | Version |
---|---|---|
Cost-Optimized Strategy | Route to the cheapest equivalent model - picks the provider that offers the same model or configured equivalent models for the lowest price | v2 |
Model-Level Weighted Strategy | Provider + model specific weighting - configure weights for provider+model pairs (e.g., openai/o1 vs bedrock/claude-3-5-sonnet) | v2 |
Tag-based Routing | Header-driven routing decisions - route requests to specific providers and models based on tags passed via request headers | v3 |