Token Limit Exception Handlers

When prompts get large, requests can exceed the model’s maximum context window. Helicone can automatically apply strategies to keep your request within limits or switch to a fallback model — without changing your app code.

What This Does

Estimates tokens for your request based on model and content
Accounts for reserved output tokens (e.g., max_tokens, max_output_tokens)
Applies a chosen strategy only when the estimated input exceeds the allowed context

Helicone uses provider-aware heuristics to estimate tokens and a best-effort approach across different request shapes.

Strategies

Truncate (truncate): Normalize and trim message content to reduce token count.
Middle-out (middle-out): Preserve the beginning and end of messages while trimming middle content to fit the limit.
Fallback (fallback): Switch to an alternate model when the request is too large. Provide multiple candidates in the request body model field as a comma-separated list (first is primary, second is fallback).

For fallback, Helicone picks the second candidate if needed. When under the limit, Helicone normalizes the model to the primary. If your body lacks model, set Helicone-Model-Override.

Quick Start

Add the Helicone-Token-Limit-Exception-Handler header to enable a strategy.

import { OpenAI } from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai/v1",
  apiKey: process.env.HELICONE_API_KEY,
});

// Middle-out strategy
await client.chat.completions.create(
  {
    model: "gpt-4o", // or "gpt-4o, gpt-4o-mini" for fallback
    messages: [
      { role: "user", content: "A very long prompt ..." }
    ],
    max_tokens: 256
  },
  {
    headers: {
      "Helicone-Token-Limit-Exception-Handler": "middle-out"
    }
  }
);

Configuration

Enable and control via headers:

Helicone-Token-Limit-Exception-Handler

string

required

One of: truncate, middle-out, fallback.

Helicone-Model-Override

string

Optional. Used for token estimation and model selection when the request body doesn’t include a model or you need to override it.

Fallback Model Selection

Provide candidates in the body: model: "primary, fallback"
Helicone chooses the fallback when input exceeds the allowed context
When under the limit, Helicone normalizes the model to the primary

Notes

Token estimation is heuristic and provider-aware; behavior is best-effort across request shapes.
Allowed context accounts for requested completion tokens (e.g., max_tokens).
Changes are applied before the provider call; your logged request reflects the applied strategy.

Need more help?

Additional questions or feedback? Reach out to help@helicone.ai or schedule a call with us.

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

What This Does

Strategies

Quick Start

Configuration

Fallback Model Selection

Notes

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

​What This Does

​Strategies

​Quick Start

​Configuration

​Fallback Model Selection

​Notes

What This Does

Strategies

Quick Start

Configuration

Fallback Model Selection

Notes