Dashboard view of cache hits, cost and time saved.

Introduction

Helicone uses Cloudflare Workers to temporarily store data closer to the user to ensure low latency, resulting in faster responses and reduced costs.

Why Cache

  • Faster response for commonly asked questions, resulting in better experience for your users.
  • Lower latency and reduce the load on backend resources by pre-computing results or frequently accessed data, so you can develop your app more efficiently.
  • Save money while testing your app by making fewer calls to model providers such as OpenAI.
  • Determine the most common requests with your application and visualize on a dashboard.

Quick Start

To get started, set Helicone-Cache-Enabled to true in the headers, or use the Python or NPM packages to turn it on via parameters.

Cache Parameters

ParameterDescription
Helicone-Cache-Enabled (required)Set to true to enable storing and loading from your cache.
Cache-Control (optional)Configure cache limit as a string based on the Cloudflare Cache Directive. Currently we only support max-age, but we will be adding more configuration options soon. I.e. 1 hour is max-age=3600.
Helicone-Cache-Bucket-Max-Size (optional)Configure your Cache Bucket size as a number.
Helicone-Cache-Seed (optional)Define a separate cache state as a string to generate predictable results, i.e. user-123.

Header values have to be strings. For example, "Helicone-Cache-Bucket-Max-Size": "10".

Changing Cache Limit

The default cache limit is 7 days. To change the limit, add the Cache-Control header to your request.

Example: Setting the cache limit to 30 days, aka 2592000 seconds

"Cache-Control": "max-age=2592000"
The max cache limit is 365 days, or max-age=31536000.

Configuring Bucket Size

Simply add Helicone-Cache-Bucket-Max-Size with some number to choose how large you want your bucket size to be.

Example: A bucket size of 3

openai.completion("give me a random number") -> "42"
# Cache Miss
openai.completion("give me a random number") -> "47"
# Cache Miss
openai.completion("give me a random number") -> "17"
# Cache Miss

openai.completion("give me a random number") -> This will randomly choose 42 | 47 | 17
# Cache Hit

The max number of caches you can store is 20 within a bucket, if you want more you will need to upgrade to an enterprise plan.

Adding Cache Seed

When you make a request to Helicone with the same seed, you will receive the same cached response for the same query. This feature allows for predictable results, which can be beneficial in scenarios where you want to have a consistent cache across multiple requests.

To set a cache seed, add a header called Helicone-Cache-Seed with a string value for the seed.

  "Helicone-Cache-Seed": "user-123"

Example: Making the same request with 2 different seeds

By making a request with a Cache Seed user-123 and query “give me a random number”, you will always receive the same response (e.g., “42”), as long as the cache conditions remain unchanged. Now change the Cache Seed to user-456 while making the same query will yield a different result (e.g., “17”), demonstrating how different seeds can maintain separate cache states.

# Bucket size 1

# Cache Seed "user-123"
openai.completion("give me a random number") -> "42"
openai.completion("give me a random number") -> "42"

# Cache Seed "user-456"
openai.completion("give me a random number") -> "17"

# Cache Seed "user-123"
openai.completion("give me a random number") -> "42"

# Cache Seed "user-456"
openai.completion("give me a random number") -> "17"

If you don’t like one of generated response stored in cache, you can update your seed to a different value as a way to clear your cache.

Extracting Cache Response Headers

When cache is enabled, you can capture the cache status from the headers of the response, such as a cache hit / miss and the cache bucket index of the response returned.

helicone-cache:	"HIT" | "MISS" // indicates whether the response was cached.
helicone-cache-bucket-idx: number // indicates the cache bucket index used.

Example: Extracting headers from python with OpenAI

client = OpenAI(
    api_key="<OPENAI_API_KEY>",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer <API_KEY>",
    }
)

# 1. add `.with_raw_response` here
chat_completion_raw = client.chat.completions.with_raw_response.create(
    model="gpt-4-vision-preview",
    messages=[
        {"role": "user", "content": "Hello world!"}
    ],
    extra_headers={
        "Helicone-Cache-Enabled": "true" # make sure cache is enabled
        "Cache-Control": "max-age = 2592000", # change cache limit (optional)
        "Helicone-Cache-Bucket-Max-Size": "3", # configure cache bucket size (optional)
        "Helicone-Cache-Seed": "1", # add cache seed (optional)
    },
)

# This is the original parsed response as expected...
chat_completion = chat_completion_raw.parse()

# 2. get header response
cache_hit = chat_completion_raw.http_response.headers.get(
    'Helicone-Cache')

print(cache_hit) # will print "HIT" or "MISS"