Response Caching
Intelligent LLM response caching to reduce costs and improve latency
The AI Gateway automatically caches LLM responses and reuses them for identical requests, reducing costs by up to 95% and improving response times.
Caching uses exact parameter matching with configurable TTL, staleness policies, and bucketed responses for variety.
Benefits:
- Eliminate CI/test costs by reusing responses across test runs and development
- Reduce costs by eliminating duplicate API calls to providers
- Improve latency by serving cached responses instantly
- Handle high traffic by reducing load on upstream providers
- Cross-provider efficiency by reusing responses across different providers
Quick Start
Create your configuration
Create ai-gateway-config.yaml
with basic caching (1-hour TTL with 30-minute stale allowance):
Start the gateway
Test caching
Send this request twice and see the second request return instantly from cache with helicone-cache: HIT
header!
✅ The second request returns instantly from cache with helicone-cache: HIT
header!
For complete configuration options and syntax, see the Configuration Reference.
Cache Options
Use Cases
Use case: CI pipeline or test suite that makes repeated identical requests. Cache for the duration of the test run to eliminate all provider costs.
Use case: CI pipeline or test suite that makes repeated identical requests. Cache for the duration of the test run to eliminate all provider costs.
Use case: Production API that needs to minimize provider costs while maintaining response freshness for users.
Use case: Different environments with different caching strategies - production optimized for freshness, development for cost savings.
If using router cache configuration, we suggest not using the global cache configuration due to the merging behavior confusion.
How It Works
Exact Parameter Matching
All caching uses exact parameter matching—identical requests (model, messages, temperature, all parameters) return cached responses instantly. Request parameters are hashed to create a unique cache key.
Request Flow
Request Arrives
A request comes in with specific parameters (model, messages, temperature, etc.)
Configuration Merge
Cache settings are merged in precedence order:
- Request headers: Highest priority (can override everything)
- Router configuration: Middle priority
- Global configuration: Lowest priority (fallback defaults)
Cache Key Generation
Request parameters are hashed to create a unique cache key, optionally prefixed with seed for namespacing
Cache Lookup
System checks the cache store for an existing response that matches the key and isn’t expired
Cache Hit or Miss
- Hit: Returns cached response instantly with
helicone-cache: HIT
header - Miss: Forwards request to provider, caches response, returns with
helicone-cache: MISS
header
Configuration Scope
Cache settings are applied in precedence order (highest to lowest priority):
Level | Description | When Applied |
---|---|---|
Request Headers | Per-request cache control via headers | Overrides all other settings |
Router Configuration | Per-router cache policies | Overrides global defaults |
Global Configuration | Application-wide cache defaults | Used as fallback |
Available Headers
Control caching behavior per-request with these headers:
Helicone-Cache-Enabled: true/false
- Enable/disable cachingCache-Control: "max-age=3600"
- Override cache directiveHelicone-Cache-Seed: "custom-seed"
- Set cache namespaceHelicone-Cache-Bucket-Max-Size: 5
- Override bucket size
Cache Response Headers
When caching is enabled, the gateway adds response headers to indicate cache status:
helicone-cache: HIT/MISS
- Whether response was served from cachehelicone-cache-bucket-idx: 2
- Index of cache bucket used (0-based)
Storage Backend Options
Cache responses can be stored in different backends depending on your deployment needs:
In-memory storage is currently the only available option. Additional storage backends (Redis and Database) are coming soon for distributed caching and advanced analytics.
Strategy Selection Guide
Use Case | Recommended Approach |
---|---|
Production APIs | 1-hour TTL, buckets 1-3 |
Development/Testing | 24-hour TTL, buckets 5-10 |
Creative applications | 30-min TTL, buckets 10+ |
High-traffic systems | Short TTL (≤2 h), buckets 3-5 |
User-specific caching | Seeds for namespace isolation |
Single instance | In-memory storage |
For complete configuration options and syntax, see the Configuration Reference.
Coming Soon
The following caching features are planned for future releases:
Feature | Description | Version |
---|---|---|
Redis Storage | Distributed cache sharing across multiple router instances with persistence across restarts | v1 |
Database Storage | Persistent cache storage with advanced analytics and compliance features | v1 |