LLM Caching

When developing and testing LLM applications, you often make the same requests repeatedly during debugging and iteration. Caching stores responses on the edge using Cloudflare Workers, eliminating redundant API calls and reducing both latency and costs.

Why use Caching

Save money during development: Avoid repeated charges for identical requests while testing and debugging
Reduce response latency: Serve cached responses instantly instead of waiting for LLM providers
Handle traffic spikes: Protect against rate limits and maintain performance during high usage

Helicone Dashboard showing the number of cache hits, cost, and time saved.

Dashboard view of cache hits, cost and time saved

Quick Start

Enable caching

Add the Helicone-Cache-Enabled header to your requests:

{
  "Helicone-Cache-Enabled": "true"
}

Make your request

Execute your LLM request - the first call will be cached:

const response = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);

Verify caching works

Make the same request again - it should return instantly from cache:

// This exact same request will return a cached response
const cachedResponse = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini", 
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);

Configuration Options

Basic Settings

Control caching behavior with these headers:

Header	Type	Description	Default	Example
`Helicone-Cache-Enabled`	`string`	Enable or disable caching	N/A	`"true"`
`Cache-Control`	`string`	Set cache duration using max-age	`"max-age=604800"` (7 days)	`"max-age=3600"` (1 hour)

Advanced Settings

Header	Type	Description	Default	Example
`Helicone-Cache-Bucket-Max-Size`	`string`	Number of responses to store per cache bucket	`"1"`	`"3"`
`Helicone-Cache-Seed`	`string`	Create separate cache namespaces	N/A	`"user-123"`

All header values must be strings. For example, "Helicone-Cache-Bucket-Max-Size": "10".

Cache Duration

Set how long responses stay cached using the Cache-Control header:

{
  "Cache-Control": "max-age=3600"  // 1 hour
}

Common durations:

1 hour: max-age=3600
1 day: max-age=86400
7 days: max-age=604800 (default)
30 days: max-age=2592000

Maximum cache duration is 365 days (max-age=31536000)

Bucket Size

Control how many different responses are stored for the same request:

{
  "Helicone-Cache-Bucket-Max-Size": "3"
}

With bucket size 3, the same request can return one of 3 different cached responses randomly:

openai.completion("give me a random number") -> "42"  # Cache Miss
openai.completion("give me a random number") -> "47"  # Cache Miss  
openai.completion("give me a random number") -> "17"  # Cache Miss

openai.completion("give me a random number") -> "42" | "47" | "17"  # Cache Hit

Maximum bucket size is 20. Enterprise plans support larger buckets.

Cache Seeds

Create separate cache namespaces using seeds:

{
  "Helicone-Cache-Seed": "user-123"
}

Different seeds maintain separate cache states:

# Seed: "user-123"
openai.completion("random number") -> "42"
openai.completion("random number") -> "42"  # Same response

# Seed: "user-456"  
openai.completion("random number") -> "17"  # Different response
openai.completion("random number") -> "17"  # Consistent per seed

Change the seed value to effectively clear your cache for testing.

Use Cases

Avoid repeated charges while debugging and iterating on prompts:

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
    "Helicone-Cache-Enabled": "true",
    "Cache-Control": "max-age=86400" // Cache for 1 day during development
  },
});

// This request will be cached
const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Explain quantum computing" }]
});

// Subsequent identical requests return cached response instantly

Understanding Caching

How Cache Keys Work

Helicone generates cache keys by hashing:

Cache seed: If specified (for namespacing)
Request URL: The full endpoint URL
Request body: Complete request payload including all parameters
Relevant headers: Authorization and Helicone-specific cache headers
Bucket index: For multi-response caching

What triggers cache hits:

// ✅ Cache hit - identical requests
const request1 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };
const request2 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };

// ❌ Cache miss - different content  
const request3 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hi" }] };

// ❌ Cache miss - different parameters
const request4 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }], temperature: 0.5 };

Cache Response Headers

Check cache status in response headers:

const response = await openai.chat.completions.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

// Access raw response to check headers
const chatCompletion = await client.chat.completions.with_raw_response.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

const cacheStatus = chatCompletion.http_response.headers.get('Helicone-Cache');
console.log(cacheStatus); // "HIT" or "MISS"

const bucketIndex = chatCompletion.http_response.headers.get('Helicone-Cache-Bucket-Idx');
console.log(bucketIndex); // Index of cached response used

Cache Bucket Behavior

Cache buckets store multiple responses for the same request: Bucket Size 1 (default):

Same request always returns same cached response
Deterministic behavior

Bucket Size > 1:

Same request can return different cached responses
Useful for creative prompts where variety is desired
Response chosen randomly from bucket

Cache Limitations

Maximum duration: 365 days
Maximum bucket size: 20 (enterprise plans support more)
Cache key sensitivity: Any parameter change creates new cache entry
Storage location: Cached in Cloudflare Workers KV (edge-distributed), not your infrastructure

Custom Properties

Add metadata to cached requests for better filtering and analysis

Rate Limiting

Control request frequency and combine with caching for cost optimization

User Metrics

Track cache hit rates and savings per user or application

Webhooks

Get notified about cache performance and optimization opportunities

Need more help?

Getting Started

Integrations

Tracing

Prompts & Evals

AI Gateway

References

Why use Caching

Quick Start

Configuration Options

Basic Settings

Advanced Settings

Use Cases

Understanding Caching

How Cache Keys Work

Cache Response Headers

Cache Bucket Behavior

Cache Limitations

Custom Properties

Rate Limiting

User Metrics

Webhooks

Getting Started

Integrations

Tracing

Prompts & Evals

AI Gateway

References

​Why use Caching

​Quick Start

​Configuration Options

​Basic Settings

​Advanced Settings

​Use Cases

​Understanding Caching

​How Cache Keys Work

​Cache Response Headers

​Cache Bucket Behavior

​Cache Limitations

​Related Features

Custom Properties

Rate Limiting

User Metrics

Webhooks

Why use Caching

Quick Start

Configuration Options

Basic Settings

Advanced Settings

Use Cases

Understanding Caching

How Cache Keys Work

Cache Response Headers

Cache Bucket Behavior

Cache Limitations

Related Features