LLM Caching
Reduce latency and save costs on LLM calls by caching responses on the edge. Configure cache duration, bucket sizes, and use cache seeds for consistent results across requests.
Dashboard view of cache hits, cost and time saved.
Introduction
Helicone uses Cloudflare Workers to temporarily store data closer to the user to ensure low latency, resulting in faster responses and reduced costs.
Why Cache
- Faster response for commonly asked questions, resulting in better experience for your users.
- Lower latency and reduce the load on backend resources by pre-computing results or frequently accessed data, so you can develop your app more efficiently.
- Save money while testing your app by making fewer calls to model providers such as OpenAI.
- Determine the most common requests with your application and visualize on a dashboard.
Quick Start
To get started, set Helicone-Cache-Enabled
to true in the headers, or use the Python or NPM packages to turn it on via parameters.
Cache Parameters
Parameter | Description |
---|---|
Helicone-Cache-Enabled (required) | Set to true to enable storing and loading from your cache. |
Cache-Control (optional) | Configure cache limit as a string based on the Cloudflare Cache Directive. Currently we only support max-age , but we will be adding more configuration options soon. I.e. 1 hour is max-age=3600 . |
Helicone-Cache-Bucket-Max-Size (optional) | Configure your Cache Bucket size as a number . |
Helicone-Cache-Seed (optional) | Define a separate cache state as a string to generate predictable results, i.e. user-123 . |
Header values have to be strings. For example,
"Helicone-Cache-Bucket-Max-Size": "10"
.
Changing Cache Limit
The default cache limit is 7 days. To change the limit, add the Cache-Control
header to your request.
Example: Setting the cache limit to 30 days, aka 2592000 seconds
max-age=31536000
. Configuring Bucket Size
Simply add Helicone-Cache-Bucket-Max-Size
with some number to choose how large you want your bucket size to be.
Example: A bucket size of 3
The max number of caches you can store is 20
within a bucket, if you want
more you will need to upgrade to an enterprise
plan.
Adding Cache Seed
When you make a request to Helicone with the same seed, you will receive the same cached response for the same query. This feature allows for predictable results, which can be beneficial in scenarios where you want to have a consistent cache across multiple requests.
To set a cache seed, add a header called Helicone-Cache-Seed
with a string value for the seed.
Example: Making the same request with 2 different seeds
By making a request with a Cache Seed user-123
and query “give me a random number”, you will always receive the same response (e.g., “42”), as long as the cache conditions remain unchanged. Now change the Cache Seed to user-456
while making the same query will yield a different result (e.g., “17”), demonstrating how different seeds can maintain separate cache states.
If you don’t like one of generated response stored in cache, you can update your seed to a different value as a way to clear your cache.
Extracting Cache Response Headers
When cache is enabled, you can capture the cache status from the headers of the response, such as a cache hit / miss
and the cache bucket index
of the response returned.
Example: Extracting headers from python with OpenAI
Was this page helpful?