Intelligent LLM response caching to reduce costs and improve latency
The AI Gateway automatically caches LLM responses and reuses them for identical requests, reducing costs by up to 95% and improving response times.
Caching uses exact parameter matching with configurable TTL, staleness policies, and bucketed responses for variety.
Benefits:
Create your configuration
Create ai-gateway-config.yaml
with basic caching (1-hour TTL with 30-minute stale allowance):
Start the gateway
Test caching
Send this request twice and see the second request return instantly from cache with helicone-cache: HIT
header!
✅ The second request returns instantly from cache with helicone-cache: HIT
header!
For complete configuration options and syntax, see the Configuration Reference.
Multiple Responses (Buckets)
Store multiple responses for the same cache key
Instead of storing one response per cache key, store multiple responses to provide variety for non-deterministic use cases while still benefiting from caching.
Best for: Creative applications where response variety is desired
Example:
Cache Namespacing (Seeds)
Partition cache by seed for multi-tenant isolation
Each cache entry lives in a namespace derived from a seed. You can set the seed once in the router config or override it per-request with the Helicone-Cache-Seed
header.
Best for: SaaS apps and multi-tenant systems that need user-level isolation
How it works:
seed
valueExample (router config):
Example (per-request header):
Use case: CI pipeline or test suite that makes repeated identical requests. Cache for the duration of the test run to eliminate all provider costs.
Use case: CI pipeline or test suite that makes repeated identical requests. Cache for the duration of the test run to eliminate all provider costs.
Use case: Production API that needs to minimize provider costs while maintaining response freshness for users.
Use case: Different environments with different caching strategies - production optimized for freshness, development for cost savings.
If using router cache configuration, we suggest not using the global cache configuration due to the merging behavior confusion.
Exact Parameter Matching
All caching uses exact parameter matching—identical requests (model, messages, temperature, all parameters) return cached responses instantly. Request parameters are hashed to create a unique cache key.
Request Arrives
A request comes in with specific parameters (model, messages, temperature, etc.)
Configuration Merge
Cache settings are merged in precedence order:
Cache Key Generation
Request parameters are hashed to create a unique cache key, optionally prefixed with seed for namespacing
Cache Lookup
System checks the cache store for an existing response that matches the key and isn’t expired
Cache Hit or Miss
helicone-cache: HIT
headerhelicone-cache: MISS
headerCache settings are applied in precedence order (highest to lowest priority):
Level | Description | When Applied |
---|---|---|
Request Headers | Per-request cache control via headers | Overrides all other settings |
Router Configuration | Per-router cache policies | Overrides global defaults |
Global Configuration | Application-wide cache defaults | Used as fallback |
Control caching behavior per-request with these headers:
Helicone-Cache-Enabled: true/false
- Enable/disable cachingCache-Control: "max-age=3600"
- Override cache directiveHelicone-Cache-Seed: "custom-seed"
- Set cache namespaceHelicone-Cache-Bucket-Max-Size: 5
- Override bucket sizeWhen caching is enabled, the gateway adds response headers to indicate cache status:
helicone-cache: HIT/MISS
- Whether response was served from cachehelicone-cache-bucket-idx: 2
- Index of cache bucket used (0-based)Cache responses can be stored in different backends depending on your deployment needs:
In-Memory Storage
Local cache storage (Default)
Cache responses are stored locally in each router instance—no external dependencies, ultra-fast lookup.
Best for:
Redis
Redis cache storage
Cache responses are stored in Redis, enabling cache sharing across multiple router instances and persistence across restarts.
Best for:
Database storage for caching is coming soon.
Use Case | Recommended Approach |
---|---|
Production APIs | 1-hour TTL, buckets 1-3 |
Development/Testing | 24-hour TTL, buckets 5-10 |
Creative applications | 30-min TTL, buckets 10+ |
High-traffic systems | Short TTL (≤2 h), buckets 3-5 |
User-specific caching | Seeds for namespace isolation |
Single instance | In-memory storage |
Multiple instances | Redis storage |
For complete configuration options and syntax, see the Configuration Reference.
The following caching features are planned for future releases:
Feature | Description | Version |
---|---|---|
Database Storage | Persistent cache storage with advanced analytics and compliance features | v1 |