LLM Caching
Reduce latency and save costs on LLM calls by caching responses on the edge. Configure cache duration, bucket sizes, and use cache seeds for consistent results across requests.
Dashboard view of cache hits, cost and time saved.
Who can use this feature: Anyone on any
plan. However, the maximum number of caches
you can store within a bucket is 20
. If you need to store more, you will
need to upgrade to an enterprise plan.
Introduction
Helicone uses Cloudflare Workers to temporarily store data closer to the user to ensure low latency, resulting in faster responses and reduced costs.
Why Cache
- Faster response for commonly asked questions, resulting in better experience for your users.
- Lower latency and reduce the load on backend resources by pre-computing results or frequently accessed data, so you can develop your app more efficiently.
- Save money while testing your app by making fewer calls to model providers such as OpenAI.
- Determine the most common requests with your application and visualize on a dashboard.
Quick Start
To get started, set Helicone-Cache-Enabled
to true in the headers, or use the Python or NPM packages to turn it on via parameters.
Cache Parameters
Parameter | Description |
---|---|
Helicone-Cache-Enabled (required) | Set to true to enable storing and loading from your cache. |
Cache-Control (optional) | Configure cache limit as a string based on the Cloudflare Cache Directive. Currently we only support max-age , but we will be adding more configuration options soon. I.e. 1 hour is max-age=3600 . |
Helicone-Cache-Bucket-Max-Size (optional) | Configure your Cache Bucket size as a number . |
Helicone-Cache-Seed (optional) | Define a separate cache state as a string to generate predictable results, i.e. user-123 . |
Header values have to be strings. For example,
"Helicone-Cache-Bucket-Max-Size": "10"
.
Changing Cache Limit
The default cache limit is 7 days. To change the limit, add the Cache-Control
header to your request.
Example: Setting the cache limit to 30 days, aka 2592000 seconds
"Cache-Control": "max-age=2592000"
max-age=31536000
. Configuring Bucket Size
Simply add Helicone-Cache-Bucket-Max-Size
with some number to choose how large you want your bucket size to be.
Example: A bucket size of 3
openai.completion("give me a random number") -> "42"
# Cache Miss
openai.completion("give me a random number") -> "47"
# Cache Miss
openai.completion("give me a random number") -> "17"
# Cache Miss
openai.completion("give me a random number") -> This will randomly choose 42 | 47 | 17
# Cache Hit
The max number of caches you can store is 20
within a bucket, if you want
more you will need to upgrade to an enterprise
plan.
Adding Cache Seed
When you make a request to Helicone with the same seed, you will receive the same cached response for the same query. This feature allows for predictable results, which can be beneficial in scenarios where you want to have a consistent cache across multiple requests.
To set a cache seed, add a header called Helicone-Cache-Seed
with a string value for the seed.
"Helicone-Cache-Seed": "user-123"
Example: Making the same request with 2 different seeds
By making a request with a Cache Seed user-123
and query “give me a random number”, you will always receive the same response (e.g., “42”), as long as the cache conditions remain unchanged. Now change the Cache Seed to user-456
while making the same query will yield a different result (e.g., “17”), demonstrating how different seeds can maintain separate cache states.
# Bucket size 1
# Cache Seed "user-123"
openai.completion("give me a random number") -> "42"
openai.completion("give me a random number") -> "42"
# Cache Seed "user-456"
openai.completion("give me a random number") -> "17"
# Cache Seed "user-123"
openai.completion("give me a random number") -> "42"
# Cache Seed "user-456"
openai.completion("give me a random number") -> "17"
If you don’t like one of generated response stored in cache, you can update your seed to a different value as a way to clear your cache.
Extracting Cache Response Headers
When cache is enabled, you can capture the cache status from the headers of the response, such as a cache hit / miss
and the cache bucket index
of the response returned.
helicone-cache: "HIT" | "MISS" // indicates whether the response was cached.
helicone-cache-bucket-idx: number // indicates the cache bucket index used.
Example: Extracting headers from python with OpenAI
client = OpenAI(
api_key="<OPENAI_API_KEY>",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": f"Bearer <API_KEY>",
}
)
# 1. add `.with_raw_response` here
chat_completion_raw = client.chat.completions.with_raw_response.create(
model="gpt-4-vision-preview",
messages=[
{"role": "user", "content": "Hello world!"}
],
extra_headers={
"Helicone-Cache-Enabled": "true" # make sure cache is enabled
"Cache-Control": "max-age = 2592000", # change cache limit (optional)
"Helicone-Cache-Bucket-Max-Size": "3", # configure cache bucket size (optional)
"Helicone-Cache-Seed": "1", # add cache seed (optional)
},
)
# This is the original parsed response as expected...
chat_completion = chat_completion_raw.parse()
# 2. get header response
cache_hit = chat_completion_raw.http_response.headers.get(
'Helicone-Cache')
print(cache_hit) # will print "HIT" or "MISS"