The AI Gateway provides flexible rate limiting using GCRA (Generic Cell Rate Algorithm) to help you manage request frequency and prevent abuse. Rate limiting works per-API-key, which requires authentication to identify users.

Benefits:

  • Prevent abuse by limiting request rates per API key
  • Manage costs by controlling request frequency
  • Ensure stability by preventing traffic spikes from overwhelming your system
  • Fair usage by distributing capacity across different API keys
  • Control your own traffic based on your business requirements

Provider rate limits are handled automatically by the load balancing system. This rate limiting feature is for controlling your own API traffic based on your business requirements.

Quick Start

1

Get your Helicone API key

Rate limiting requires authentication. Get your Helicone API key:

  1. Go to Helicone Settings
  2. Click “Generate New Key”
  3. Copy the key (starts with sk-helicone-)
  4. Set it as an environment variable:
export HELICONE_CONTROL_PLANE_API_KEY="sk-helicone-your-api-key"
2

Create your configuration

Create ai-gateway-config.yaml with authentication and rate limiting:

helicone:
  authentication: true
  observability: false  # Set to true to enable observability

global:
  rate-limit:
    store: in-memory
    per-api-key:
      capacity: 1000
      refill-frequency: 1m
3

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml
4

Test rate limiting

curl -X POST http://localhost:8080/router/default/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-helicone-your-api-key" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ The gateway tracks requests per API key and enforces your limits!

For complete configuration options and syntax, see the Configuration Reference.

Available Strategies

Additional rate limiting strategies (Per-End-User, Per-Team, Spend Limits, Usage Limits) are coming soon for more granular control.

Use Cases

Use case: Production API that needs to prevent abuse while allowing reasonable burst traffic for legitimate users.

helicone:
  authentication: true
  observability: false  # Set to true to enable observability

global:
  rate-limit:
    store: in-memory
    per-api-key:
      capacity: 1000
      refill-frequency: 1m  # 1000 requests per minute
    cleanup-interval: 5m

How It Works

Request Flow

1

Request Arrives

A request comes in with an API key via Authorization header

2

Rate Limit Check

The gateway checks rate limits in precedence order:

  • Global rate limits: Application-wide limits checked first
  • Router-specific rate limits: Individual router limits checked second
3

Token Consumption

If limits allow, the request consumes a token from the API key’s bucket

4

Request Processing

The request proceeds to load balancing and provider routing

5

Token Refill

Tokens continuously refill at the configured rate for future requests

Configuration Scope

Rate limits are applied at different levels with clear precedence:

LevelDescriptionWhen Applied
Global Rate LimitsApplication-wide limits across all routersChecked first as safety net
Router-Specific Rate LimitsIndividual router limits or opt-outChecked after global limits pass

Storage Backend Options

Rate limiting counters can be stored in different backends depending on your deployment needs:

For complete configuration options and syntax, see the Configuration Reference.

Coming Soon

The following rate limiting features are planned for future releases:

FeatureDescriptionVersion
Redis StorageDistributed rate limiting state stored in Redis for coordination across multiple router instancesv1
Database StoragePersistent rate limiting state with advanced querying capabilities for analytics and compliancev2
Per-End-User LimitsRate limits applied to end users via Helicone-User-Id header for SaaS user quotasv1
Per-Team LimitsRate limits applied to teams for budget and governance controlsv2
Per-Team-Member LimitsRate limits applied to individual team members for governancev2
Spend LimitsCost-based limits that restrict usage based on dollar amounts spent per time periodv2
Usage LimitsToken-based limits that restrict usage based on input/output tokens consumedv2