Prompt Caching

Prompt caching allows you to cache frequently-used context (system prompts, examples, documents) and reuse it across multiple requests at significantly reduced costs.

Why Prompt Caching

Reduce Token Costs

Cached prompts are processed at significantly reduced rates by providers (up to 90% savings)

Faster Processing

Providers skip re-processing cached prompt segments for faster response times

Automatic Optimization

Works out-of-the-box with OpenAI compatible AI Gateway across all providers

OpenAI and Compatible Providers

Automatic caching for prompts over 1024 tokens. Use the prompt_cache_key parameter for better cache hit control. Compatible providers: OpenAI, Grok, Groq, Deepseek, Moonshot AI, Azure OpenAI

Quick Start

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "system",
      content: "Very long system prompt that will be automatically cached..." // 1024+ tokens
    },
    {
      role: "user", 
      content: "What is machine learning?"
    }
  ],
  prompt_cache_key: `doc-analysis-${documentId}` // Optional: control caching keys
});

Pricing

OpenAI charges standard rates for cache writes and offers significant discounts for cache reads. Exact pricing varies by model.

Helicone Model Registry

View supported models and their caching capabilities

OpenAI Prompt Caching Documentation

Official OpenAI prompt caching guide

Anthropic (Claude)

Anthropic provides advanced caching with cache control breakpoints (up to 4 per request) and TTL control.

Using OpenAI SDK with Helicone Types

The @helicone/helpers SDK extends OpenAI types to support Anthropic’s cache control through the OpenAI-compatible interface:

npm install @helicone/helpers

import OpenAI from "openai";
import { HeliconeChatCreateParams } from "@helicone/helpers";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "claude-3.5-haiku",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant...",
      cache_control: {
        type: "ephemeral",
        ttl: "1h"
      }
    },
    {
        role: "assistant",
        content: "Example assistant message.",
        cache_control: { type: "ephemeral" }
    },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "This content will be cached.",
          cache_control: {
            type: "ephemeral",
            ttl: "5m"
          }
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/image.jpg",
            detail: "low"
          },
          cache_control: { type: "ephemeral" }
        }
      ]
    }
  ],
  temperature: 0.7
} as HeliconeChatCreateParams);

Cache Key Mapping

Anthropic uses user_id as a cache key on their servers. When using the OpenAI-compatible AI Gateway, these parameters automatically map to Anthropic’s user_id:

prompt_cache_key
safety_identifier
user

const response = await client.chat.completions.create({
  model: "claude-3.5-haiku",
  messages: [/* your messages */],
  prompt_cache_key: "doc-analysis-v1", // Maps to Anthropic's user_id for cache keying
  cache_control: {
    type: "ephemeral", 
    ttl: "1h"
  }
} as HeliconeChatCreateParams);

Current Limitation: Anthropic cache control is currently enabled for caching messages only. Support for caching tools is coming soon.

Pricing Structure

Anthropic uses a simple multiplier-based pricing model for prompt caching.

Operation	Multiplier	Example (Claude Sonnet @ $3/MTok)
Cache Read	0.1×	$0.30/MTok
Cache Write (5 min)	1.25×	$3.75/MTok
Cache Write (1 hour)	2.0×	$6.00/MTok

Key Points

TTL Options: 5 minutes or 1 hour
Providers: Available on Anthropic API, Vertex AI, and AWS Bedrock
Limitation: Vertex AI and Bedrock only support 5-minute caching
Minimum: 1024 tokens for most models

Calculation Example

Base input price: $3/MTok
5-min cache write: $3 × 1.25 = $3.75/MTok
1-hour cache write: $3 × 2.0 = $6.00/MTok
Cache read: $3 × 0.1 = $0.30/MTok

Learn More

Anthropic Prompt Caching Documentation

Google Gemini

Google uses a multiplier plus storage cost model for context caching.

Pricing Structure

Operation	Multiplier	Storage Cost
Cache Read	0.25×	N/A
Cache Write	1.0×	+ Storage fee

Storage Rates:

Gemini 2.5 Pro: $4.50/MTok/hour
Gemini 2.5 Flash: $1.00/MTok/hour
Gemini 2.5 Flash-Lite: $1.00/MTok/hour

Key Points

TTL: 5 minutes only
Cache Types: Implicit (automatic) and Explicit (manual)
Minimum: 1024 tokens (Flash), 2048 tokens (Pro)
Discount: 75% off input costs for cache reads

Calculation Example

For Gemini 2.5 Pro (≤200K tokens):

Base input price: $1.25/MTok
Storage rate: $4.50/MTok/hour

Cache write (5 min):
- Input cost: $1.25 × 1.0 = $1.25
- Storage cost: $4.50 × (5/60) = $0.375
- Total: $1.625/MTok

Cache read: $1.25 × 0.25 = $0.31/MTok

Tiered Pricing

Gemini 2.5 Pro has different rates for larger contexts:

Context Size	Input Price	Cache Read	Cache Write (5 min)
≤200K tokens	$1.25/MTok	$0.31/MTok	$1.625/MTok
>200K tokens	$2.50/MTok	$0.625/MTok	$2.875/MTok

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

Why Prompt Caching

Reduce Token Costs

Faster Processing

Automatic Optimization

OpenAI and Compatible Providers

Quick Start

Pricing

Helicone Model Registry

OpenAI Prompt Caching Documentation

Anthropic (Claude)

Using OpenAI SDK with Helicone Types

Cache Key Mapping

Pricing Structure

Key Points

Calculation Example

Learn More

Google Gemini

Pricing Structure

Key Points

Calculation Example

Tiered Pricing

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

​Why Prompt Caching

Reduce Token Costs

Faster Processing

Automatic Optimization

​OpenAI and Compatible Providers

​Quick Start

​Pricing

Helicone Model Registry

OpenAI Prompt Caching Documentation

​Anthropic (Claude)

​Using OpenAI SDK with Helicone Types

​Cache Key Mapping

​Pricing Structure

​Key Points

​Calculation Example

Learn More

​Google Gemini

​Pricing Structure

​Key Points

​Calculation Example

​Tiered Pricing

Why Prompt Caching

OpenAI and Compatible Providers

Quick Start

Pricing

Anthropic (Claude)

Using OpenAI SDK with Helicone Types

Cache Key Mapping

Pricing Structure

Key Points

Calculation Example

Google Gemini

Pricing Structure

Key Points

Calculation Example

Tiered Pricing