Rate Limiting & Spend Controls - Helicone OSS LLM Observability

The AI Gateway provides flexible rate limiting using GCRA (Generic Cell Rate Algorithm) to help you manage request frequency and prevent abuse. Rate limiting works per-API-key, which requires authentication to identify users. Benefits:

Prevent abuse by limiting request rates per API key
Manage costs by controlling request frequency
Ensure stability by preventing traffic spikes from overwhelming your system
Fair usage by distributing capacity across different API keys
Control your own traffic based on your business requirements

Provider rate limits are handled automatically by the load balancing system. This rate limiting feature is for controlling your own API traffic based on your business requirements.

Quick Start

Get your Helicone API key

Rate limiting requires authentication. Get your Helicone API key:

Go to Helicone Settings
Click “Generate New Key”
Copy the key (starts with sk-helicone-)
Set it as an environment variable:

export HELICONE_CONTROL_PLANE_API_KEY="sk-helicone-your-api-key"

Create your configuration

Create ai-gateway-config.yaml with authentication and rate limiting:

helicone:
  # Set to `features: observability` to enable observability and auth
  features: auth

global:
  rate-limit:
    store: in-memory
    per-api-key:
      capacity: 1000
      refill-frequency: 1m

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml

Test rate limiting

curl -X POST http://localhost:8080/ai/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-helicone-your-api-key" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ The gateway tracks requests per API key and enforces your limits!

For complete configuration options and syntax, see the Configuration Reference.

Available Strategies

Per-API-Key Rate Limiting - Default

GCRA-based token bucket with burst capacityEach API key gets a virtual token bucket with specified capacity. Requests consume tokens, which refill at the specified rate. Uses GCRA (Generic Cell Rate Algorithm) for smooth sustained rates with burst allowance.Best for: Preventing API key abuse while allowing reasonable burst trafficHow it works:

Each API key gets a token bucket with specified capacity
Requests consume tokens from the bucket
Tokens refill at the specified rate (capacity / refill-frequency)
Requests are allowed if tokens are available, rejected otherwise

Example:

global:
  rate-limit:
    per-api-key:
      capacity: 500
      refill-frequency: 1m  # 500 requests per minute sustained

Additional rate limiting strategies (Per-End-User, Per-Team, Spend Limits, Usage Limits) are coming soon for more granular control.

Use Cases

Use case: Production API that needs to prevent abuse while allowing reasonable burst traffic for legitimate users.

helicone:
  # Set to `features: observability` to enable observability and auth
  features: auth

rate-limit-store:
  type: in-memory

global:
  rate-limit:
    per-api-key:
      capacity: 1000
      refill-frequency: 1m  # 1000 requests per minute
    cleanup-interval: 5m

How It Works

Request Flow

Request Arrives

A request comes in with an API key via Authorization header

Rate Limit Check

The gateway checks rate limits in precedence order:

Global rate limits: Application-wide limits checked first
Router-specific rate limits: Individual router limits checked second

Token Consumption

If limits allow, the request consumes a token from the API key’s bucket

Request Processing

The request proceeds to load balancing and provider routing

Token Refill

Tokens continuously refill at the configured rate for future requests

Configuration Scope

Rate limits are applied at different levels with clear precedence:

Level	Description	When Applied
Global Rate Limits	Application-wide limits across all routers	Checked first as safety net
Router-Specific Rate Limits	Individual router limits or opt-out	Checked after global limits pass

Storage Backend Options

Rate limiting counters can be stored in different backends depending on your deployment needs:

In-Memory Storage

Local memory storageRate limiting state is stored locally in each router instance. Fast and simple, but limits are not shared across multiple instances.

rate-limit-store:
  type: in-memory

Best for:

Single instance deployments
Development environments
High-performance scenarios where cross-instance coordination isn’t needed

Redis Storage

Distributed rate limiting state stored in Redis for coordination across multiple router instances

rate-limit-store:
  type: redis

With custom host-url:

rate-limit-store:
  type: redis
  host-url: "redis://localhost:6379"

Options:

host-url: Redis connection string
- Supports standard Redis URL format: redis://[username:password@]host[:port][/database]
connection-timeout: Connection timeout in seconds (default: 5)

Mixed Storage

Mixed storage for different environments

rate-limit-store:
  type: redis

global: # this will use redis for global rate limits
  rate-limit:
    per-api-key:
      capacity: 1000
      refill-frequency: 1m

routers:
  production:
    rate-limit: # this will also use redis for router-specific rate limits
      per-api-key:
        capacity: 1000
        refill-frequency: 1m

  development:
    rate-limit:
      store: in-memory # this will use in-memory for router-specific rate limits
      per-api-key:
        capacity: 100
        refill-frequency: 1m

For complete configuration options and syntax, see the Configuration Reference.

Coming Soon

The following rate limiting features are planned for future releases:

Feature	Description	Version
Database Storage	Persistent rate limiting state with advanced querying capabilities for analytics and compliance	v2
Per-End-User Limits	Rate limits applied to end users via `Helicone-User-Id` header for SaaS user quotas	v1
Per-Team Limits	Rate limits applied to teams for budget and governance controls	v2
Per-Team-Member Limits	Rate limits applied to individual team members for governance	v2
Spend Limits	Cost-based limits that restrict usage based on dollar amounts spent per time period	v2
Usage Limits	Token-based limits that restrict usage based on input/output tokens consumed	v2

Guides

​Quick Start

​Available Strategies

​Use Cases

​How It Works

​Request Flow

​Configuration Scope

​Storage Backend Options

​Coming Soon

Quick Start

Available Strategies

Use Cases

How It Works

Request Flow

Configuration Scope

Storage Backend Options

Coming Soon