GCRA-based rate limiting with burst capacity and smooth request throttling
The AI Gateway provides flexible rate limiting using GCRA (Generic Cell Rate Algorithm) to help you manage request frequency and prevent abuse. Rate limiting works per-API-key, which requires authentication to identify users.
Benefits:
Provider rate limits are handled automatically by the load balancing system. This rate limiting feature is for controlling your own API traffic based on your business requirements.
Get your Helicone API key
Rate limiting requires authentication. Get your Helicone API key:
sk-helicone-
)Create your configuration
Create ai-gateway-config.yaml
with authentication and rate limiting:
Start the gateway
Test rate limiting
✅ The gateway tracks requests per API key and enforces your limits!
For complete configuration options and syntax, see the Configuration Reference.
Per-API-Key Rate Limiting - Default
GCRA-based token bucket with burst capacity
Each API key gets a virtual token bucket with specified capacity. Requests consume tokens, which refill at the specified rate. Uses GCRA (Generic Cell Rate Algorithm) for smooth sustained rates with burst allowance.
Best for: Preventing API key abuse while allowing reasonable burst traffic
How it works:
Example:
Additional rate limiting strategies (Per-End-User, Per-Team, Spend Limits, Usage Limits) are coming soon for more granular control.
Use case: Production API that needs to prevent abuse while allowing reasonable burst traffic for legitimate users.
Use case: Production API that needs to prevent abuse while allowing reasonable burst traffic for legitimate users.
Use case: Different service tiers with varying rate limits. Premium router gets higher limits than basic router.
Use case: Different environments with different rate limiting strategies - production allowing higher throughput, development with conservative limits for cost safety.
Request Arrives
A request comes in with an API key via Authorization header
Rate Limit Check
The gateway checks rate limits in precedence order:
Token Consumption
If limits allow, the request consumes a token from the API key’s bucket
Request Processing
The request proceeds to load balancing and provider routing
Token Refill
Tokens continuously refill at the configured rate for future requests
Rate limits are applied at different levels with clear precedence:
Level | Description | When Applied |
---|---|---|
Global Rate Limits | Application-wide limits across all routers | Checked first as safety net |
Router-Specific Rate Limits | Individual router limits or opt-out | Checked after global limits pass |
Rate limiting counters can be stored in different backends depending on your deployment needs:
In-Memory Storage
Local memory storage
Rate limiting state is stored locally in each router instance. Fast and simple, but limits are not shared across multiple instances.
Best for:
Redis Storage
Distributed rate limiting state stored in Redis for coordination across multiple router instances
With custom host-url:
Options:
host-url
: Redis connection string
redis://[username:password@]host[:port][/database]
connection-timeout
: Connection timeout in seconds (default: 5)Mixed Storage
Mixed storage for different environments
For complete configuration options and syntax, see the Configuration Reference.
The following rate limiting features are planned for future releases:
Feature | Description | Version |
---|---|---|
Database Storage | Persistent rate limiting state with advanced querying capabilities for analytics and compliance | v2 |
Per-End-User Limits | Rate limits applied to end users via Helicone-User-Id header for SaaS user quotas | v1 |
Per-Team Limits | Rate limits applied to teams for budget and governance controls | v2 |
Per-Team-Member Limits | Rate limits applied to individual team members for governance | v2 |
Spend Limits | Cost-based limits that restrict usage based on dollar amounts spent per time period | v2 |
Usage Limits | Token-based limits that restrict usage based on input/output tokens consumed | v2 |