When developing and testing LLM applications, you often make the same requests repeatedly during debugging and iteration. Caching stores responses on the edge using Cloudflare Workers, eliminating redundant API calls and reducing both latency and costs.

Why use Caching

  • Save money during development: Avoid repeated charges for identical requests while testing and debugging
  • Reduce response latency: Serve cached responses instantly instead of waiting for LLM providers
  • Handle traffic spikes: Protect against rate limits and maintain performance during high usage
Helicone Dashboard showing the number of cache hits, cost, and time saved.

Dashboard view of cache hits, cost and time saved

Quick Start

1

Enable caching

Add the Helicone-Cache-Enabled header to your requests:
{
  "Helicone-Cache-Enabled": "true"
}
2

Make your request

Execute your LLM request - the first call will be cached:
const response = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);
3

Verify caching works

Make the same request again - it should return instantly from cache:
// This exact same request will return a cached response
const cachedResponse = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini", 
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);

Configuration Options

Basic Settings

Control caching behavior with these headers:
HeaderTypeDescriptionDefaultExample
Helicone-Cache-EnabledstringEnable or disable cachingN/A"true"
Cache-ControlstringSet cache duration using max-age"max-age=604800" (7 days)"max-age=3600" (1 hour)

Advanced Settings

HeaderTypeDescriptionDefaultExample
Helicone-Cache-Bucket-Max-SizestringNumber of responses to store per cache bucket"1""3"
Helicone-Cache-SeedstringCreate separate cache namespacesN/A"user-123"
All header values must be strings. For example, "Helicone-Cache-Bucket-Max-Size": "10".

Use Cases

Avoid repeated charges while debugging and iterating on prompts:
import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
    "Helicone-Cache-Enabled": "true",
    "Cache-Control": "max-age=86400" // Cache for 1 day during development
  },
});

// This request will be cached
const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Explain quantum computing" }]
});

// Subsequent identical requests return cached response instantly

Understanding Caching

How Cache Keys Work

Helicone generates cache keys by hashing:
  • Cache seed: If specified (for namespacing)
  • Request URL: The full endpoint URL
  • Request body: Complete request payload including all parameters
  • Relevant headers: Authorization and Helicone-specific cache headers
  • Bucket index: For multi-response caching
What triggers cache hits:
// ✅ Cache hit - identical requests
const request1 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };
const request2 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };

// ❌ Cache miss - different content  
const request3 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hi" }] };

// ❌ Cache miss - different parameters
const request4 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }], temperature: 0.5 };

Cache Response Headers

Check cache status in response headers:
const response = await openai.chat.completions.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

// Access raw response to check headers
const chatCompletion = await client.chat.completions.with_raw_response.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

const cacheStatus = chatCompletion.http_response.headers.get('Helicone-Cache');
console.log(cacheStatus); // "HIT" or "MISS"

const bucketIndex = chatCompletion.http_response.headers.get('Helicone-Cache-Bucket-Idx');
console.log(bucketIndex); // Index of cached response used

Cache Bucket Behavior

Cache buckets store multiple responses for the same request: Bucket Size 1 (default):
  • Same request always returns same cached response
  • Deterministic behavior
Bucket Size > 1:
  • Same request can return different cached responses
  • Useful for creative prompts where variety is desired
  • Response chosen randomly from bucket

Cache Limitations

  • Maximum duration: 365 days
  • Maximum bucket size: 20 (enterprise plans support more)
  • Cache key sensitivity: Any parameter change creates new cache entry
  • Storage location: Cached in Cloudflare Workers KV (edge-distributed), not your infrastructure