Rate Limiting & Spend Controls
GCRA-based rate limiting with burst capacity and smooth request throttling
The AI Gateway provides flexible rate limiting using GCRA (Generic Cell Rate Algorithm) to help you manage request frequency and prevent abuse. Rate limiting works per-API-key, which requires authentication to identify users.
Benefits:
- Prevent abuse by limiting request rates per API key
- Manage costs by controlling request frequency
- Ensure stability by preventing traffic spikes from overwhelming your system
- Fair usage by distributing capacity across different API keys
- Control your own traffic based on your business requirements
Provider rate limits are handled automatically by the load balancing system. This rate limiting feature is for controlling your own API traffic based on your business requirements.
Quick Start
Get your Helicone API key
Rate limiting requires authentication. Get your Helicone API key:
- Go to Helicone Settings
- Click “Generate New Key”
- Copy the key (starts with
sk-helicone-
) - Set it as an environment variable:
Create your configuration
Create ai-gateway-config.yaml
with authentication and rate limiting:
Start the gateway
Test rate limiting
✅ The gateway tracks requests per API key and enforces your limits!
For complete configuration options and syntax, see the Configuration Reference.
Available Strategies
Additional rate limiting strategies (Per-End-User, Per-Team, Spend Limits, Usage Limits) are coming soon for more granular control.
Use Cases
Use case: Production API that needs to prevent abuse while allowing reasonable burst traffic for legitimate users.
Use case: Production API that needs to prevent abuse while allowing reasonable burst traffic for legitimate users.
Use case: Different service tiers with varying rate limits. Premium router gets higher limits than basic router.
Use case: Different environments with different rate limiting strategies - production allowing higher throughput, development with conservative limits for cost safety.
How It Works
Request Flow
Request Arrives
A request comes in with an API key via Authorization header
Rate Limit Check
The gateway checks rate limits in precedence order:
- Global rate limits: Application-wide limits checked first
- Router-specific rate limits: Individual router limits checked second
Token Consumption
If limits allow, the request consumes a token from the API key’s bucket
Request Processing
The request proceeds to load balancing and provider routing
Token Refill
Tokens continuously refill at the configured rate for future requests
Configuration Scope
Rate limits are applied at different levels with clear precedence:
Level | Description | When Applied |
---|---|---|
Global Rate Limits | Application-wide limits across all routers | Checked first as safety net |
Router-Specific Rate Limits | Individual router limits or opt-out | Checked after global limits pass |
Storage Backend Options
Rate limiting counters can be stored in different backends depending on your deployment needs:
For complete configuration options and syntax, see the Configuration Reference.
Coming Soon
The following rate limiting features are planned for future releases:
Feature | Description | Version |
---|---|---|
Redis Storage | Distributed rate limiting state stored in Redis for coordination across multiple router instances | v1 |
Database Storage | Persistent rate limiting state with advanced querying capabilities for analytics and compliance | v2 |
Per-End-User Limits | Rate limits applied to end users via Helicone-User-Id header for SaaS user quotas | v1 |
Per-Team Limits | Rate limits applied to teams for budget and governance controls | v2 |
Per-Team-Member Limits | Rate limits applied to individual team members for governance | v2 |
Spend Limits | Cost-based limits that restrict usage based on dollar amounts spent per time period | v2 |
Usage Limits | Token-based limits that restrict usage based on input/output tokens consumed | v2 |