Rate limits are an important feature that allows you to control the number of requests made with your API key within a specific time window. For example, you can limit users to 1000 requests per day or 60 requests per minute. By implementing rate limits, you can prevent abuse while protecting your resources from being overwhelmed by excessive traffic.

Keep in mind that your custom rate limiting policy is secondary to OpenAI’s rate limits for your key. For more information on OpenAI’s rate limits, refer to their documentation here.

Getting Started

To set up rate limiting, you only need to provide the Helicone-RateLimit-Policy header in your request. This will rate limit all requests made with the specified API key.

The header value should follow this format:

  • quota (required): The maximum number of requests allowed within the specified time window.
  • time_window (required): The length of the time window in seconds. The minimum value is 60.
  • segment (optional): The rate limiting segment. Can be “user” or a custom property. If left blank, this rate limits all of your requests made with the api key.
curl https://oai.hconeai.com/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Helicone-Property-IP:' \
  -H 'Helicone-RateLimit-Policy: 1000;w=60;s=ip' \
  -d '{
    "model": "text-davinci-003",
    "prompt": "How do I enable custom rate limit policies?",

Fun fact: this policy format is an IETF standard for specifying rate limits! Except for the segment field, that’s a Helicone special twist 🍬

Filtering by segments

You can rate limit for all of your requests made with the API key, by user, or by a custom property. Here’s how to set the segment field s=[segment]:

  • For global rate limiting, leave the segment field empty. Your policy can look like 1000;w=60
  • For rate limiting by user, set the segment field to user. The user ID must be included as a parameter in the request or in the helicone-user-id header, see User Metrics for more details.
  • For rate limiting by a custom property, set the segment field to the desired property name in the policy 1000;w=60;s=[property_name], and include a corresponding header in the request, formatted as helicone-property-{property_name}.

The minimum value for the time window is 60. The only unit for the time window field is seconds, so for example, use 60

  • 60 * 24 = 86400 for a single day.


The following are a list of example policies to use for Helicone-RateLimit-Policy

Latency Considerations

Using rate limits adds a small amount of latency to your requests. This feature is deployed with Cloudflare’s key-value store, which is a low-latency service that stores data in a small number of centralized data centers and caches that data in Cloudflare’s data centers after access. The latency add-on is minimal compared to multi-second OpenAI requests.

Rate Limit Error

If a request is rate-limited, a 429 rate limit error will be returned.

Upcoming Features

Very soon, we will support rate limiting by tokens and by cost. Additionally, you will be able to see how close your requests, users, and properties are to hitting their rate limits in the web UI.

Returned Headers

If rate limiting is active, the following headers will be returned:

  • Helicone-RateLimit-Limit: The quota for the number of requests allowed in the time window.
  • Helicone-RateLimit-Remaining: The remaining quota in the current window.
  • Helicone-RateLimit-Remaining: The remaining quota in the current window.
  • Helicone-RateLimit-Policy: The active rate limit policy.

These headers are only returned if a rate limit policy is active.

Test out your rate limit policy in a local environment before deploying to production. Contact help@helicone.ai if you have any questions.