Documentation Index Fetch the complete documentation index at: https://docs.helicone.ai/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
LiteLLM is an self-hosted interface for calling LLM APIs.
Integration Steps
HELICONE_API_KEY=sk-helicone-...
pip install litellm python-dotenv
Use LiteLLM with Helicone
Add the helicone/ prefix to any model name to logg requests for Helicone: import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
# Route through Helicone by adding "helicone/" prefix
response = completion(
model = "helicone/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "What is the capital of France?" }],
api_key = os.getenv( "HELICONE_API_KEY" )
)
print (response.choices[ 0 ].message.content)
Complete Working Examples
Basic Completion
import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
# Simple completion
response = completion(
model = "helicone/gpt-4o-mini" ,
messages = [{ "role" : "user" , "content" : "Tell me a fun fact about space" }],
api_key = os.getenv( "HELICONE_API_KEY" )
)
print (response.choices[ 0 ].message.content)
Streaming Responses
import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
# Streaming example
response = completion(
model = "helicone/claude-4.5-sonnet" ,
messages = [{ "role" : "user" , "content" : "Write a short story about a robot learning to paint" }],
stream = True ,
api_key = os.getenv( "HELICONE_API_KEY" )
)
print ( "🤖 Assistant (streaming):" )
for chunk in response:
if hasattr (chunk.choices[ 0 ].delta, 'content' ) and chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" , flush = True )
print ( " \n " )
Custom Properties and Session Tracking
Add metadata to track and filter your requests:
import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
response = completion(
model = "helicone/gpt-4o-mini" ,
messages = [{ "role" : "user" , "content" : "What's the weather like?" }],
api_key = os.getenv( "HELICONE_API_KEY" ),
metadata = {
"Helicone-Session-Id" : "session-abc-123" ,
"Helicone-Session-Name" : "Weather Assistant" ,
"Helicone-User-Id" : "user-789" ,
"Helicone-Property-Environment" : "production" ,
"Helicone-Property-App-Version" : "2.1.0" ,
"Helicone-Property-Feature" : "weather-query"
}
)
print (response.choices[ 0 ].message.content)
Provider Selection and Fallback
Helicone’s AI Gateway supports automatic failover between providers:
import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
# Automatic routing (cheapest provider)
response = completion(
model = "helicone/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = os.getenv( "HELICONE_API_KEY" )
)
# Manual provider selection
response = completion(
model = "helicone/claude-4.5-sonnet/anthropic" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = os.getenv( "HELICONE_API_KEY" )
)
# Multiple provider fallback chain
# Try OpenAI first, then Anthropic if it fails
response = completion(
model = "helicone/gpt-4o/openai,claude-4.5-sonnet/anthropic" ,
messages = [{ "role" : "user" , "content" : "Hello!" }],
api_key = os.getenv( "HELICONE_API_KEY" )
)
Advanced Features
Caching
Enable caching to reduce costs and latency for repeated requests:
import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
# Enable caching for this request
response = completion(
model = "helicone/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "What is 2+2?" }],
api_key = os.getenv( "HELICONE_API_KEY" ),
metadata = {
"Helicone-Cache-Enabled" : "true"
}
)
print (response.choices[ 0 ].message.content)
# Subsequent identical requests will be served from cache
response2 = completion(
model = "helicone/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "What is 2+2?" }],
api_key = os.getenv( "HELICONE_API_KEY" ),
metadata = {
"Helicone-Cache-Enabled" : "true"
}
)
print (response2.choices[ 0 ].message.content)
Rate Limiting
Apply rate limiting policies to control request rates:
import os
from litellm import completion
from dotenv import load_dotenv
load_dotenv()
response = completion(
model = "helicone/gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Hello" }],
api_key = os.getenv( "HELICONE_API_KEY" ),
metadata = {
"Helicone-Rate-Limit-Policy" : "basic-100"
}
)
print (response.choices[ 0 ].message.content)
AI Gateway Overview Learn about Helicone’s AI Gateway features and capabilities
Provider Routing Configure intelligent routing and automatic failover
Model Registry Browse all available models and providers
Custom Properties Add metadata to track and filter your requests
Sessions Track multi-turn conversations and user sessions
Rate Limiting Configure rate limits for your applications
Caching Reduce costs and latency with intelligent caching
LiteLLM Documentation Official LiteLLM documentation