Response Caching - Helicone OSS LLM Observability

The AI Gateway automatically caches LLM responses and reuses them for identical requests, reducing costs by up to 95% and improving response times. Caching uses exact parameter matching with configurable TTL, staleness policies, and bucketed responses for variety. Benefits:

Eliminate CI/test costs by reusing responses across test runs and development
Reduce costs by eliminating duplicate API calls to providers
Improve latency by serving cached responses instantly
Handle high traffic by reducing load on upstream providers
Cross-provider efficiency by reusing responses across different providers

Quick Start

Create your configuration

Create ai-gateway-config.yaml with basic caching (1-hour TTL with 30-minute stale allowance):

cache-store:
  type: "in-memory"

global:
  cache:
    directive: "max-age=3600, max-stale=1800"
    buckets: 1

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml

Test caching

Send this request twice and see the second request return instantly from cache with helicone-cache: HIT header!

curl -X POST http://localhost:8080/ai/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ The second request returns instantly from cache with helicone-cache: HIT header!

For complete configuration options and syntax, see the Configuration Reference.

Cache Options

Multiple Responses (Buckets)

Store multiple responses for the same cache keyInstead of storing one response per cache key, store multiple responses to provide variety for non-deterministic use cases while still benefiting from caching.Best for: Creative applications where response variety is desiredExample:

cache-store:
  type: "in-memory"

global:
  cache:
    buckets: 10  # Store up to 10 different responses
    directive: "max-age=3600"

Cache Namespacing (Seeds)

Partition cache by seed for multi-tenant isolationEach cache entry lives in a namespace derived from a seed. You can set the seed once in the router config or override it per-request with the Helicone-Cache-Seed header.Best for: SaaS apps and multi-tenant systems that need user-level isolationHow it works:

Router config can set a default seed value
Incoming requests may override the seed via header
The seed is prefixed to the cache key, creating an isolated namespace

Example (router config):

cache-store:
  type: "in-memory"

global:
  cache:
    directive: "max-age=3600, max-stale=1800"
    seed: "tenant-alpha"

Example (per-request header):

curl -H "Helicone-Cache-Seed: user-123" ...

Use Cases

Use case: CI pipeline or test suite that makes repeated identical requests. Cache for the duration of the test run to eliminate all provider costs.

cache-store:
  type: "in-memory"

global:
  cache:
    directive: "max-age=7200, max-stale=3600"  # 2-hour TTL for test runs
    buckets: 1   # Consistent responses for tests

How It Works

Exact Parameter Matching All caching uses exact parameter matching—identical requests (model, messages, temperature, all parameters) return cached responses instantly. Request parameters are hashed to create a unique cache key.

Request Flow

Request Arrives

A request comes in with specific parameters (model, messages, temperature, etc.)

Configuration Merge

Cache settings are merged in precedence order:

Request headers: Highest priority (can override everything)
Router configuration: Middle priority
Global configuration: Lowest priority (fallback defaults)

Cache Key Generation

Request parameters are hashed to create a unique cache key, optionally prefixed with seed for namespacing

Cache Lookup

System checks the cache store for an existing response that matches the key and isn’t expired

Cache Hit or Miss

Hit: Returns cached response instantly with helicone-cache: HIT header
Miss: Forwards request to provider, caches response, returns with helicone-cache: MISS header

Configuration Scope

Cache settings are applied in precedence order (highest to lowest priority):

Level	Description	When Applied
Request Headers	Per-request cache control via headers	Overrides all other settings
Router Configuration	Per-router cache policies	Overrides global defaults
Global Configuration	Application-wide cache defaults	Used as fallback

Available Headers

Control caching behavior per-request with these headers:

Helicone-Cache-Enabled: true/false - Enable/disable caching
Cache-Control: "max-age=3600" - Override cache directive
Helicone-Cache-Seed: "custom-seed" - Set cache namespace
Helicone-Cache-Bucket-Max-Size: 5 - Override bucket size

Cache Response Headers

When caching is enabled, the gateway adds response headers to indicate cache status:

helicone-cache: HIT/MISS - Whether response was served from cache
helicone-cache-bucket-idx: 2 - Index of cache bucket used (0-based)

Storage Backend Options

Cache responses can be stored in different backends depending on your deployment needs:

In-Memory Storage

Local cache storage (Default)Cache responses are stored locally in each router instance—no external dependencies, ultra-fast lookup.

cache-store:
  type: "in-memory" # Uses default max-size

global:
  cache:
    directive: "max-age=3600"

Best for:

Single-instance deployments
Development environments
High-performance scenarios

Redis

Redis cache storageCache responses are stored in Redis, enabling cache sharing across multiple router instances and persistence across restarts.

cache-store:
  type: "redis"
  host-url: "redis://localhost:6379"

global:
  cache:
    directive: "max-age=3600"

Best for:

Distributed cache sharing across multiple router instances
Cache persistence across restarts

Database storage for caching is coming soon.

Strategy Selection Guide

Use Case	Recommended Approach
Production APIs	1-hour TTL, buckets 1-3
Development/Testing	24-hour TTL, buckets 5-10
Creative applications	30-min TTL, buckets 10+
High-traffic systems	Short TTL (≤2 h), buckets 3-5
User-specific caching	Seeds for namespace isolation
Single instance	In-memory storage
Multiple instances	Redis storage

For complete configuration options and syntax, see the Configuration Reference.

Coming Soon

The following caching features are planned for future releases:

Feature	Description	Version
Database Storage	Persistent cache storage with advanced analytics and compliance features	v1

Guides

​Quick Start

​Cache Options

​Use Cases

​How It Works

​Request Flow

​Configuration Scope

​Available Headers

​Cache Response Headers

​Storage Backend Options

​Strategy Selection Guide

​Coming Soon

Quick Start

Cache Options

Use Cases

How It Works

Request Flow

Configuration Scope

Available Headers

Cache Response Headers

Storage Backend Options

Strategy Selection Guide

Coming Soon