> ## Documentation Index
> Fetch the complete documentation index at: https://docs.helicone.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prompt Caching

> Cache frequently-used context across LLM providers for reduced costs and faster responses

Prompt caching allows you to cache frequently-used context (system prompts, examples, documents) and reuse it across multiple requests at significantly reduced costs.

## Why Prompt Caching

<CardGroup cols={3}>
  <Card title="Reduce Token Costs" icon="dollar-sign">
    Cached prompts are processed at significantly reduced rates by providers (up to 90% savings)
  </Card>

  <Card title="Faster Processing" icon="bolt">
    Providers skip re-processing cached prompt segments for faster response times
  </Card>

  <Card title="Automatic Optimization" icon="puzzle-piece">
    Works out-of-the-box with OpenAI compatible AI Gateway across all providers
  </Card>
</CardGroup>

***

## OpenAI and Compatible Providers

**Automatic caching** for prompts over 1024 tokens. Use the `prompt_cache_key` parameter for better cache hit control.

**Compatible providers:** OpenAI, Grok, Groq, Deepseek, Moonshot AI, Azure OpenAI

### Quick Start

```typescript theme={null}
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "system",
      content: "Very long system prompt that will be automatically cached..." // 1024+ tokens
    },
    {
      role: "user", 
      content: "What is machine learning?"
    }
  ],
  prompt_cache_key: `doc-analysis-${documentId}` // Optional: control caching keys
});
```

### Pricing

OpenAI charges standard rates for cache writes and offers significant discounts for cache reads. Exact pricing varies by model.

<CardGroup cols={2}>
  <Card title="Helicone Model Registry" href="https://helicone.ai/models">
    View supported models and their caching capabilities
  </Card>

  <Card title="OpenAI Prompt Caching Documentation" href="https://platform.openai.com/docs/guides/prompt-caching">
    Official OpenAI prompt caching guide
  </Card>
</CardGroup>

***

## Anthropic (Claude)

Anthropic provides advanced caching with **cache control breakpoints** (up to 4 per request) and TTL control.

### Using OpenAI SDK with Helicone Types

The `@helicone/helpers` SDK extends OpenAI types to support Anthropic's cache control through the OpenAI-compatible interface:

```bash theme={null}
npm install @helicone/helpers
```

```typescript theme={null}
import OpenAI from "openai";
import { HeliconeChatCreateParams } from "@helicone/helpers";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create({
  model: "claude-3.5-haiku",
  messages: [
    {
      role: "system",
      content: "You are a helpful assistant...",
      cache_control: {
        type: "ephemeral",
        ttl: "1h"
      }
    },
    {
        role: "assistant",
        content: "Example assistant message.",
        cache_control: { type: "ephemeral" }
    },
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "This content will be cached.",
          cache_control: {
            type: "ephemeral",
            ttl: "5m"
          }
        },
        {
          type: "image_url",
          image_url: {
            url: "https://example.com/image.jpg",
            detail: "low"
          },
          cache_control: { type: "ephemeral" }
        }
      ]
    }
  ],
  temperature: 0.7
} as HeliconeChatCreateParams);
```

### Cache Key Mapping

Anthropic uses `user_id` as a cache key on their servers. When using the OpenAI-compatible AI Gateway, these parameters automatically map to Anthropic's `user_id`:

* `prompt_cache_key`
* `safety_identifier`
* `user`

```typescript theme={null}
const response = await client.chat.completions.create({
  model: "claude-3.5-haiku",
  messages: [/* your messages */],
  prompt_cache_key: "doc-analysis-v1", // Maps to Anthropic's user_id for cache keying
  cache_control: {
    type: "ephemeral", 
    ttl: "1h"
  }
} as HeliconeChatCreateParams);
```

<Note>
  **Current Limitation**: Anthropic cache control is currently enabled for caching messages only. Support for caching tools is coming soon.
</Note>

### Pricing Structure

Anthropic uses a simple multiplier-based pricing model for prompt caching.

| Operation            | Multiplier | Example (Claude Sonnet @ \$3/MTok) |
| -------------------- | ---------- | ---------------------------------- |
| Cache Read           | 0.1×       | \$0.30/MTok                        |
| Cache Write (5 min)  | 1.25×      | \$3.75/MTok                        |
| Cache Write (1 hour) | 2.0×       | \$6.00/MTok                        |

### Key Points

* **TTL Options**: 5 minutes or 1 hour
* **Providers**: Available on Anthropic API, Vertex AI, and AWS Bedrock
* **Limitation**: Vertex AI and Bedrock only support 5-minute caching
* **Minimum**: 1024 tokens for most models

### Calculation Example

```
Base input price: $3/MTok
5-min cache write: $3 × 1.25 = $3.75/MTok
1-hour cache write: $3 × 2.0 = $6.00/MTok
Cache read: $3 × 0.1 = $0.30/MTok
```

<Card title="Learn More" href="https://docs.claude.com/en/docs/build-with-claude/prompt-caching">
  Anthropic Prompt Caching Documentation
</Card>

***

## Google Gemini

Google uses a multiplier plus storage cost model for context caching.

### Pricing Structure

| Operation   | Multiplier | Storage Cost  |
| ----------- | ---------- | ------------- |
| Cache Read  | 0.25×      | N/A           |
| Cache Write | 1.0×       | + Storage fee |

**Storage Rates:**

* Gemini 2.5 Pro: \$4.50/MTok/hour
* Gemini 2.5 Flash: \$1.00/MTok/hour
* Gemini 2.5 Flash-Lite: \$1.00/MTok/hour

### Key Points

* **TTL**: 5 minutes only
* **Cache Types**: Implicit (automatic) and Explicit (manual)
* **Minimum**: 1024 tokens (Flash), 2048 tokens (Pro)
* **Discount**: 75% off input costs for cache reads

### Calculation Example

For Gemini 2.5 Pro (≤200K tokens):

```
Base input price: $1.25/MTok
Storage rate: $4.50/MTok/hour

Cache write (5 min):
- Input cost: $1.25 × 1.0 = $1.25
- Storage cost: $4.50 × (5/60) = $0.375
- Total: $1.625/MTok

Cache read: $1.25 × 0.25 = $0.31/MTok
```

### Tiered Pricing

Gemini 2.5 Pro has different rates for larger contexts:

| Context Size | Input Price | Cache Read   | Cache Write (5 min) |
| ------------ | ----------- | ------------ | ------------------- |
| ≤200K tokens | \$1.25/MTok | \$0.31/MTok  | \$1.625/MTok        |
| >200K tokens | \$2.50/MTok | \$0.625/MTok | \$2.875/MTok        |
