What This Does
- Estimates tokens for your request based on model and content
- Accounts for reserved output tokens (e.g.,
max_tokens,max_output_tokens) - Applies a chosen strategy only when the estimated input exceeds the allowed context
Helicone uses provider-aware heuristics to estimate tokens and a best-effort approach across different request shapes.
Strategies
- Truncate (
truncate): Normalize and trim message content to reduce token count. - Middle-out (
middle-out): Preserve the beginning and end of messages while trimming middle content to fit the limit. - Fallback (
fallback): Switch to an alternate model when the request is too large. Provide multiple candidates in the request bodymodelfield as a comma-separated list (first is primary, second is fallback).
For
fallback, Helicone picks the second candidate if needed. When under the limit, Helicone normalizes the model to the primary. If your body lacks model, set Helicone-Model-Override.Quick Start
Add theHelicone-Token-Limit-Exception-Handler header to enable a strategy.
Configuration
Enable and control via headers:One of:
truncate, middle-out, fallback.Optional. Used for token estimation and model selection when the request body doesn’t include a
model or you need to override it.Fallback Model Selection
- Provide candidates in the body:
model: "primary, fallback" - Helicone chooses the fallback when input exceeds the allowed context
- When under the limit, Helicone normalizes the
modelto the primary
Notes
- Token estimation is heuristic and provider-aware; behavior is best-effort across request shapes.
- Allowed context accounts for requested completion tokens (e.g.,
max_tokens). - Changes are applied before the provider call; your logged request reflects the applied strategy.
Need more help?
Need more help?
Additional questions or feedback? Reach out to
[email protected] or schedule a
call with us.