Why Prompt Caching
Reduce Token Costs
Cached prompts are processed at significantly reduced rates by providers (up to 90% savings)
Faster Processing
Providers skip re-processing cached prompt segments for faster response times
Automatic Optimization
Works out-of-the-box with OpenAI compatible AI Gateway across all providers
OpenAI and Compatible Providers
Automatic caching for prompts over 1024 tokens. Use theprompt_cache_key
parameter for better cache hit control.
Compatible providers: OpenAI, Grok, Groq, Deepseek, Moonshot AI, Azure OpenAI
Quick Start
Pricing
OpenAI charges standard rates for cache writes and offers significant discounts for cache reads. Exact pricing varies by model.Helicone Model Registry
View supported models and their caching capabilities
OpenAI Prompt Caching Documentation
Official OpenAI prompt caching guide
Anthropic (Claude)
Anthropic provides advanced caching with cache control breakpoints (up to 4 per request) and TTL control.Using OpenAI SDK with Helicone Types
The@helicone/helpers
SDK extends OpenAI types to support Anthropic’s cache control through the OpenAI-compatible interface:
Cache Key Mapping
Anthropic usesuser_id
as a cache key on their servers. When using the OpenAI-compatible AI Gateway, these parameters automatically map to Anthropic’s user_id
:
prompt_cache_key
safety_identifier
user
Current Limitation: Anthropic cache control is currently enabled for caching messages only. Support for caching tools is coming soon.
Pricing Structure
Anthropic uses a simple multiplier-based pricing model for prompt caching.Operation | Multiplier | Example (Claude Sonnet @ $3/MTok) |
---|---|---|
Cache Read | 0.1× | $0.30/MTok |
Cache Write (5 min) | 1.25× | $3.75/MTok |
Cache Write (1 hour) | 2.0× | $6.00/MTok |
Key Points
- TTL Options: 5 minutes or 1 hour
- Providers: Available on Anthropic API, Vertex AI, and AWS Bedrock
- Limitation: Vertex AI and Bedrock only support 5-minute caching
- Minimum: 1024 tokens for most models
Calculation Example
Learn More
Anthropic Prompt Caching Documentation
Google Gemini
Google uses a multiplier plus storage cost model for context caching.Pricing Structure
Operation | Multiplier | Storage Cost |
---|---|---|
Cache Read | 0.25× | N/A |
Cache Write | 1.0× | + Storage fee |
- Gemini 2.5 Pro: $4.50/MTok/hour
- Gemini 2.5 Flash: $1.00/MTok/hour
- Gemini 2.5 Flash-Lite: $1.00/MTok/hour
Key Points
- TTL: 5 minutes only
- Cache Types: Implicit (automatic) and Explicit (manual)
- Minimum: 1024 tokens (Flash), 2048 tokens (Pro)
- Discount: 75% off input costs for cache reads
Calculation Example
For Gemini 2.5 Pro (≤200K tokens):Tiered Pricing
Gemini 2.5 Pro has different rates for larger contexts:Context Size | Input Price | Cache Read | Cache Write (5 min) |
---|---|---|---|
≤200K tokens | $1.25/MTok | $0.31/MTok | $1.625/MTok |
>200K tokens | $2.50/MTok | $0.625/MTok | $2.875/MTok |