The AI Gateway automatically caches LLM responses and reuses them for identical requests, reducing costs by up to 95% and improving response times.

Caching uses exact parameter matching with configurable TTL, staleness policies, and bucketed responses for variety.

Benefits:

  • Eliminate CI/test costs by reusing responses across test runs and development
  • Reduce costs by eliminating duplicate API calls to providers
  • Improve latency by serving cached responses instantly
  • Handle high traffic by reducing load on upstream providers
  • Cross-provider efficiency by reusing responses across different providers

Quick Start

1

Create your configuration

Create ai-gateway-config.yaml with basic caching (1-hour TTL with 30-minute stale allowance):

cache-store:
  in-memory: {}

global:
  cache:
    directive: "max-age=3600, max-stale=1800"
    buckets: 1
2

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml
3

Test caching

Send this request twice and see the second request return instantly from cache with helicone-cache: HIT header!

curl -X POST http://localhost:8080/ai/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ The second request returns instantly from cache with helicone-cache: HIT header!

For complete configuration options and syntax, see the Configuration Reference.

Cache Options

Use Cases

Use case: CI pipeline or test suite that makes repeated identical requests. Cache for the duration of the test run to eliminate all provider costs.

cache-store:
  in-memory: {}

global:
  cache:
    directive: "max-age=7200, max-stale=3600"  # 2-hour TTL for test runs
    buckets: 1   # Consistent responses for tests

How It Works

Exact Parameter Matching

All caching uses exact parameter matching—identical requests (model, messages, temperature, all parameters) return cached responses instantly. Request parameters are hashed to create a unique cache key.

Request Flow

1

Request Arrives

A request comes in with specific parameters (model, messages, temperature, etc.)

2

Configuration Merge

Cache settings are merged in precedence order:

  • Request headers: Highest priority (can override everything)
  • Router configuration: Middle priority
  • Global configuration: Lowest priority (fallback defaults)
3

Cache Key Generation

Request parameters are hashed to create a unique cache key, optionally prefixed with seed for namespacing

4

Cache Lookup

System checks the cache store for an existing response that matches the key and isn’t expired

5

Cache Hit or Miss

  • Hit: Returns cached response instantly with helicone-cache: HIT header
  • Miss: Forwards request to provider, caches response, returns with helicone-cache: MISS header

Configuration Scope

Cache settings are applied in precedence order (highest to lowest priority):

LevelDescriptionWhen Applied
Request HeadersPer-request cache control via headersOverrides all other settings
Router ConfigurationPer-router cache policiesOverrides global defaults
Global ConfigurationApplication-wide cache defaultsUsed as fallback

Available Headers

Control caching behavior per-request with these headers:

  • Helicone-Cache-Enabled: true/false - Enable/disable caching
  • Cache-Control: "max-age=3600" - Override cache directive
  • Helicone-Cache-Seed: "custom-seed" - Set cache namespace
  • Helicone-Cache-Bucket-Max-Size: 5 - Override bucket size

Cache Response Headers

When caching is enabled, the gateway adds response headers to indicate cache status:

  • helicone-cache: HIT/MISS - Whether response was served from cache
  • helicone-cache-bucket-idx: 2 - Index of cache bucket used (0-based)

Storage Backend Options

Cache responses can be stored in different backends depending on your deployment needs:

In-memory storage is currently the only available option. Additional storage backends (Redis and Database) are coming soon for distributed caching and advanced analytics.

Strategy Selection Guide

Use CaseRecommended Approach
Production APIs1-hour TTL, buckets 1-3
Development/Testing24-hour TTL, buckets 5-10
Creative applications30-min TTL, buckets 10+
High-traffic systemsShort TTL (≤2 h), buckets 3-5
User-specific cachingSeeds for namespace isolation
Single instanceIn-memory storage

For complete configuration options and syntax, see the Configuration Reference.

Coming Soon

The following caching features are planned for future releases:

FeatureDescriptionVersion
Redis StorageDistributed cache sharing across multiple router instances with persistence across restartsv1
Database StoragePersistent cache storage with advanced analytics and compliance featuresv1