When developing and testing LLM applications, you often make the same requests repeatedly during debugging and iteration. Caching stores responses on the edge using Cloudflare Workers, eliminating redundant API calls and reducing both latency and costs. With the AI Gateway, caching works seamlessly across all providers - cache a Claude response and save on Anthropic API calls, or cache GPT-4 responses while testing different models.

Why Caching

Save Money

Avoid repeated charges for identical requests while testing and debugging

Instant Responses

Serve cached responses immediately instead of waiting for LLM providers

Handle Traffic Spikes

Protect against rate limits and maintain performance during high usage

How It Works

Helicone’s caching system stores LLM responses on Cloudflare’s edge network, providing globally distributed, low-latency access to cached data.

Cache Key Generation

Helicone generates unique cache keys by hashing:
  • Cache seed - Optional namespace identifier (if specified)
  • Request URL - The full endpoint URL
  • Request body - Complete request payload including all parameters
  • Relevant headers - Authorization and cache-specific headers
  • Bucket index - For multi-response caching
Any change in these components creates a new cache entry:
// ✅ Cache hit - identical requests
const request1 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };
const request2 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };

// ❌ Cache miss - different content  
const request3 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hi" }] };

// ❌ Cache miss - different parameters
const request4 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }], temperature: 0.5 };

Cache Storage

  • Responses are stored in Cloudflare Workers KV (key-value store)
  • Distributed across 300+ global edge locations
  • Automatic replication and failover
  • No impact on your infrastructure

Quick Start

1

Enable caching

Add the Helicone-Cache-Enabled header to your requests:
{
  "Helicone-Cache-Enabled": "true"
}
2

Make your request

Execute your LLM request - the first call will be cached:
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);
3

Verify caching works

Make the same request again - it should return instantly from cache:
// This exact same request will return a cached response
const cachedResponse = await client.chat.completions.create(
  {
    model: "gpt-4o-mini", 
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);

Configuration

Helicone-Cache-Enabled
string
required
Enable or disable caching for the request.Example: "true" to enable caching
Cache-Control
string
Set cache duration using standard HTTP cache control directives.Default: "max-age=604800" (7 days)Example: "max-age=3600" for 1 hour cache
Helicone-Cache-Bucket-Max-Size
string
Number of different responses to store for the same request. Useful for non-deterministic prompts.Default: "1" (single response cached)Example: "3" to cache up to 3 different responses
Helicone-Cache-Seed
string
Create separate cache namespaces for different users or contexts.Example: "user-123" to maintain user-specific cache
Helicone-Cache-Ignore-Keys
string
Comma-separated JSON keys to exclude from cache key generation.Example: "request_id,timestamp" to ignore these fields when generating cache keys
All header values must be strings. For example, "Helicone-Cache-Bucket-Max-Size": "10".

Examples

Combine Helicone caching with OpenAI’s native prompt caching by ignoring the prompt_cache_key parameter:
const response = await client.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ 
      role: "user", 
      content: "Analyze this large document with cached context..." 
    }],
    prompt_cache_key: `doc-analysis-${documentId}` // Different per document
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Ignore-Keys": "prompt_cache_key", // Ignore this for Helicone cache
      "Cache-Control": "max-age=3600"                   // Cache for 1 hour
    }
  }
);

// Requests with the same message but different prompt_cache_key values 
// will hit Helicone's cache, while still leveraging OpenAI's prompt caching
// for improved performance and cost savings on both sides
This approach:
  • Uses OpenAI’s prompt caching for faster processing of repeated context
  • Uses Helicone’s caching for instant responses to identical requests
  • Ignores prompt_cache_key so Helicone cache works across different OpenAI cache entries
  • Maximizes cost savings by combining both caching strategies
Helicone Dashboard showing the number of cache hits, cost, and time saved.

Dashboard view of cache hits, cost and time saved

Understanding Caching

Cache Response Headers

Check cache status by examining response headers:
const response = await client.chat.completions.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

// Access raw response to check headers
const chatCompletion = await client.chat.completions.with_raw_response.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

const cacheStatus = chatCompletion.http_response.headers.get('Helicone-Cache');
console.log(cacheStatus); // "HIT" or "MISS"

const bucketIndex = chatCompletion.http_response.headers.get('Helicone-Cache-Bucket-Idx');
console.log(bucketIndex); // Index of cached response used

Cache Duration

Set how long responses stay cached using the Cache-Control header:
{
  "Cache-Control": "max-age=3600"  // 1 hour
}
Common durations:
  • 1 hour: max-age=3600
  • 1 day: max-age=86400
  • 7 days: max-age=604800 (default)
  • 30 days: max-age=2592000
Maximum cache duration is 365 days (max-age=31536000)

Cache Buckets

Control how many different responses are stored for the same request:
{
  "Helicone-Cache-Bucket-Max-Size": "3"
}
With bucket size 3, the same request can return one of 3 different cached responses randomly:
openai.completion("give me a random number") -> "42"  # Cache Miss
openai.completion("give me a random number") -> "47"  # Cache Miss  
openai.completion("give me a random number") -> "17"  # Cache Miss

openai.completion("give me a random number") -> "42" | "47" | "17"  # Cache Hit
Behavior by bucket size:
  • Size 1 (default): Same request always returns same cached response (deterministic)
  • Size > 1: Same request can return different cached responses (useful for creative prompts)
  • Response chosen randomly from bucket
Maximum bucket size is 20. Enterprise plans support larger buckets.

Cache Seeds

Create separate cache namespaces using seeds:
{
  "Helicone-Cache-Seed": "user-123"
}
Different seeds maintain separate cache states:
# Seed: "user-123"
openai.completion("random number") -> "42"
openai.completion("random number") -> "42"  # Same response

# Seed: "user-456"  
openai.completion("random number") -> "17"  # Different response
openai.completion("random number") -> "17"  # Consistent per seed
Change the seed value to effectively clear your cache for testing.

Ignore Keys

Exclude specific JSON fields from cache key generation:
{
  "Helicone-Cache-Ignore-Keys": "request_id,timestamp,session_id"
}
When these fields are ignored, requests with different values for these fields will still hit the same cache entry:
// First request
const response1 = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello" }],
    request_id: "req-123",
    timestamp: "2024-01-01T00:00:00Z"
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Ignore-Keys": "request_id,timestamp"
    }
  }
);

// Second request with different request_id and timestamp
// This will hit the cache despite different values
const response2 = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello" }],
    request_id: "req-456",  // Different ID
    timestamp: "2024-02-02T00:00:00Z"  // Different timestamp
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Ignore-Keys": "request_id,timestamp"
    }
  }
);
// response2 returns cached response from response1
This feature only works with JSON request bodies. Non-JSON bodies will use the original text for cache key generation.
Common use cases:
  • Ignore tracking IDs that don’t affect the response
  • Exclude timestamps for time-independent queries
  • Remove session or user metadata when caching shared content
  • Ignore prompt_cache_key when using OpenAI prompt caching alongside Helicone caching

Cache Limitations

  • Maximum duration: 365 days
  • Maximum bucket size: 20 (enterprise plans support more)
  • Cache key sensitivity: Any parameter change creates new cache entry
  • Storage location: Cached in Cloudflare Workers KV (edge-distributed), not your infrastructure