Prompt caching allows you to cache frequently-used context (system prompts, examples, documents) and reuse it across multiple requests at significantly reduced costs.

Why Prompt Caching

Reduce Token Costs

Cached prompts are processed at significantly reduced rates by providers (up to 90% savings)

Faster Processing

Providers skip re-processing cached prompt segments for faster response times

Automatic Optimization

Works out-of-the-box with OpenAI compatible AI Gateway across all providers

OpenAI and Compatible Providers

Automatic caching for prompts over 1024 tokens. Use the prompt_cache_key parameter for better cache hit control. Compatible providers: OpenAI, Grok, Groq, Deepseek, Moonshot AI, Azure OpenAI

Quick Start

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "system",
      content: "Very long system prompt that will be automatically cached..." // 1024+ tokens
    },
    {
      role: "user", 
      content: "What is machine learning?"
    }
  ],
  prompt_cache_key: `doc-analysis-${documentId}` // Optional: control caching keys
});

Pricing

OpenAI charges standard rates for cache writes and offers significant discounts for cache reads. Exact pricing varies by model.

Anthropic (Claude)

Anthropic provides advanced caching with cache control breakpoints (up to 4 per request) and TTL control.

Using OpenAI SDK with Helicone Types

The @helicone/helpers SDK extends OpenAI types to support Anthropic’s cache control through the OpenAI-compatible interface:
npm install @helicone/helpers
import OpenAI from "openai";
import { HeliconeChatCreateParams } from "@helicone/helpers";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "claude-3.5-haiku",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant...",
      cache_control: {
        type: "ephemeral",
        ttl: "1h"
      }
    },
    {
        role: "assistant",
        content: "Example assistant message.",
        cache_control: { type: "ephemeral" }
    },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "This content will be cached.",
          cache_control: {
            type: "ephemeral",
            ttl: "5m"
          }
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/image.jpg",
            detail: "low"
          },
          cache_control: { type: "ephemeral" }
        }
      ]
    }
  ],
  temperature: 0.7
} as HeliconeChatCreateParams);

Cache Key Mapping

Anthropic uses user_id as a cache key on their servers. When using the OpenAI-compatible AI Gateway, these parameters automatically map to Anthropic’s user_id:
  • prompt_cache_key
  • safety_identifier
  • user
const response = await client.chat.completions.create({
  model: "claude-3.5-haiku",
  messages: [/* your messages */],
  prompt_cache_key: "doc-analysis-v1", // Maps to Anthropic's user_id for cache keying
  cache_control: {
    type: "ephemeral", 
    ttl: "1h"
  }
} as HeliconeChatCreateParams);
Current Limitation: Anthropic cache control is currently enabled for caching messages only. Support for caching tools is coming soon.

Pricing Structure

Anthropic uses a simple multiplier-based pricing model for prompt caching.
OperationMultiplierExample (Claude Sonnet @ $3/MTok)
Cache Read0.1×$0.30/MTok
Cache Write (5 min)1.25×$3.75/MTok
Cache Write (1 hour)2.0×$6.00/MTok

Key Points

  • TTL Options: 5 minutes or 1 hour
  • Providers: Available on Anthropic API, Vertex AI, and AWS Bedrock
  • Limitation: Vertex AI and Bedrock only support 5-minute caching
  • Minimum: 1024 tokens for most models

Calculation Example

Base input price: $3/MTok
5-min cache write: $3 × 1.25 = $3.75/MTok
1-hour cache write: $3 × 2.0 = $6.00/MTok
Cache read: $3 × 0.1 = $0.30/MTok

Learn More

Anthropic Prompt Caching Documentation

Google Gemini

Google uses a multiplier plus storage cost model for context caching.

Pricing Structure

OperationMultiplierStorage Cost
Cache Read0.25×N/A
Cache Write1.0×+ Storage fee
Storage Rates:
  • Gemini 2.5 Pro: $4.50/MTok/hour
  • Gemini 2.5 Flash: $1.00/MTok/hour
  • Gemini 2.5 Flash-Lite: $1.00/MTok/hour

Key Points

  • TTL: 5 minutes only
  • Cache Types: Implicit (automatic) and Explicit (manual)
  • Minimum: 1024 tokens (Flash), 2048 tokens (Pro)
  • Discount: 75% off input costs for cache reads

Calculation Example

For Gemini 2.5 Pro (≤200K tokens):
Base input price: $1.25/MTok
Storage rate: $4.50/MTok/hour

Cache write (5 min):
- Input cost: $1.25 × 1.0 = $1.25
- Storage cost: $4.50 × (5/60) = $0.375
- Total: $1.625/MTok

Cache read: $1.25 × 0.25 = $0.31/MTok

Tiered Pricing

Gemini 2.5 Pro has different rates for larger contexts:
Context SizeInput PriceCache ReadCache Write (5 min)
≤200K tokens$1.25/MTok$0.31/MTok$1.625/MTok
>200K tokens$2.50/MTok$0.625/MTok$2.875/MTok