Skip to main content

Introduction

LiteLLM is an self-hosted interface for calling LLM APIs.

Integration Steps

1
2
HELICONE_API_KEY=sk-helicone-...
3
pip install litellm python-dotenv
4

Use LiteLLM with Helicone

Add the helicone/ prefix to any model name to logg requests for Helicone:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Route through Helicone by adding "helicone/" prefix
response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

print(response.choices[0].message.content)
5
While you’re here, why not give us a star on GitHub? It helps us a lot!

Complete Working Examples

Basic Completion

import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Simple completion
response = completion(
    model="helicone/gpt-4o-mini",
    messages=[{"role": "user", "content": "Tell me a fun fact about space"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

print(response.choices[0].message.content)

Streaming Responses

import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Streaming example
response = completion(
    model="helicone/claude-4.5-sonnet",
    messages=[{"role": "user", "content": "Write a short story about a robot learning to paint"}],
    stream=True,
    api_key=os.getenv("HELICONE_API_KEY")
)

print("🤖 Assistant (streaming):")
for chunk in response:
    if hasattr(chunk.choices[0].delta, 'content') and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

Custom Properties and Session Tracking

Add metadata to track and filter your requests:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

response = completion(
    model="helicone/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather like?"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Session-Id": "session-abc-123",
        "Helicone-Session-Name": "Weather Assistant",
        "Helicone-User-Id": "user-789",
        "Helicone-Property-Environment": "production",
        "Helicone-Property-App-Version": "2.1.0",
        "Helicone-Property-Feature": "weather-query"
    }
)

print(response.choices[0].message.content)

Provider Selection and Fallback

Helicone’s AI Gateway supports automatic failover between providers:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Automatic routing (cheapest provider)
response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

# Manual provider selection
response = completion(
    model="helicone/claude-4.5-sonnet/anthropic",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

# Multiple provider fallback chain
# Try OpenAI first, then Anthropic if it fails
response = completion(
    model="helicone/gpt-4o/openai,claude-4.5-sonnet/anthropic",
    messages=[{"role": "user", "content": "Hello!"}],
    api_key=os.getenv("HELICONE_API_KEY")
)

Advanced Features

Caching

Enable caching to reduce costs and latency for repeated requests:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

# Enable caching for this request
response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Cache-Enabled": "true"
    }
)

print(response.choices[0].message.content)

# Subsequent identical requests will be served from cache
response2 = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Cache-Enabled": "true"
    }
)

print(response2.choices[0].message.content)

Rate Limiting

Apply rate limiting policies to control request rates:
import os
from litellm import completion
from dotenv import load_dotenv

load_dotenv()

response = completion(
    model="helicone/gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    api_key=os.getenv("HELICONE_API_KEY"),
    metadata={
        "Helicone-Rate-Limit-Policy": "basic-100"
    }
)

print(response.choices[0].message.content)