Retries

Retrying requests is a common best practice when dealing with overloaded servers or hitting rate limits. These issues typically manifest as HTTP status codes like 429 (Too Many Requests), 500 (Internal Server Error), or 503 (Service Unavailable).

Why Use Retries

Handle rate limits gracefully - Automatically retry when you hit provider rate limits
Overcome temporary failures - Recover from transient network issues or server overload
Improve reliability - Increase the success rate of your LLM requests without manual intervention

If you’re using the AI Gateway, automatic failover is usually better than retries. However, retries are ideal when you must use a specific provider endpoint (e.g., EU-hosted models for compliance, fine-tuned models, or region-specific deployments).

How It Works

Helicone uses exponential backoff to intelligently space out retry attempts. This strategy:

Starts with a short delay (default 1 second)
Doubles the wait time after each failed attempt
Caps the maximum wait time (default 10 seconds)
Prevents overwhelming the server while maximizing success chances

Example: With default settings, retries happen at: 1s → 2s → 4s → 8s → 10s

Quick Start

To enable automatic retries, add the Helicone-Retry-Enabled: true header to your requests:

import { OpenAI } from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "How do I enable retries?" }]
  },
  {
    headers: {
      "Helicone-Retry-Enabled": "true", // Add this header and set to true
    }
  }
);

Each retry attempt is logged separately in Helicone, allowing you to track retry patterns and success rates.

Configuration

Customize retry behavior with these optional headers:

Helicone-Retry-Num

string

default:"5"

Maximum number of retry attempts. Set to “0” to disable retries for specific requests.Example: "5" for up to 5 retries

Helicone-Retry-Factor

string

default:"2"

Exponential backoff multiplier. Controls how quickly the delay increases between retries.Example: "2" doubles the wait time after each attempt

Helicone-Retry-Min-Timeout

string

default:"1000"

Minimum delay between retries in milliseconds.Example: "1000" for 1 second minimum wait

Helicone-Retry-Max-Timeout

string

default:"10000"

Maximum delay between retries in milliseconds, regardless of exponential growth.Example: "10000" caps wait time at 10 seconds

All header values must be strings. Numbers should be quoted: "Helicone-Retry-Num": "3" not "Helicone-Retry-Num": 3

Common Use Cases

// Must use Azure OpenAI in Europe - cannot failover to US providers
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create(
  {
    // Specify exact Azure deployment - no fallback
    model: "gpt-4o/azure/eu-frankfurt-deployment",  
    messages: [{ role: "user", content: "Process EU customer data" }]
  },
  {
    headers: {
      "Helicone-Retry-Enabled": "true",
      "Helicone-Retry-Num": "5",
      "Helicone-Retry-Min-Timeout": "2000",
      "Helicone-Retry-Max-Timeout": "10000"
    }
  }
);

Fine-Tuned Model on Specific Provider

# Must use your fine-tuned model - cannot failover to generic models
client = OpenAI(
    base_url="https://ai-gateway.helicone.ai",
    api_key=os.getenv("HELICONE_API_KEY"),
)

response = client.chat.completions.create(
    # Specify provider explicitly for fine-tuned models
    model="ft:gpt-3.5-turbo-0125:your-org::abc123/openai",  
    messages=[{"role": "user", "content": "Domain-specific query"}],
    extra_headers={
      "Helicone-Retry-Enabled": "true",
      "Helicone-Retry-Num": "3",
      "Helicone-Retry-Factor": "2",
    }
)

Custom Provider Endpoint

// Using a self-hosted or specialized endpoint
const client = new OpenAI({
  baseURL: "https://gateway.helicone.ai/v1",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create(
  {
    model: "llama-3",
    messages: [{ role: "user", content: "Query for on-premise model" }]
  },
  {
    headers: {
      "Helicone-Target-URL": "https://your-private-llm.company.com",
      "Helicone-Retry-Enabled": "true",
      "Helicone-Retry-Num": "3"
    }
  }
);

Retry Triggers

Helicone automatically retries requests that fail with these status codes:

429 - Rate limit exceeded
500 - Internal server error
502 - Bad gateway
503 - Service unavailable
504 - Gateway timeout

Requests that fail with client errors (4xx except 429) are not retried, as these typically indicate issues with the request itself.

Need more help?

Additional questions or feedback? Reach out to [email protected] or schedule a call with us.

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

Why Use Retries

How It Works

Quick Start

Configuration

Common Use Cases

Fine-Tuned Model on Specific Provider

Custom Provider Endpoint

Retry Triggers

Getting Started

AI Gateway

Observability & Analytics

Prompt Management

Legacy Integrations

References

​Why Use Retries

​How It Works

​Quick Start

​Configuration

​Common Use Cases

​EU-Hosted Model for GDPR Compliance

​Fine-Tuned Model on Specific Provider

​Custom Provider Endpoint

​Retry Triggers

Why Use Retries

How It Works

Quick Start

Configuration

Common Use Cases

EU-Hosted Model for GDPR Compliance

Fine-Tuned Model on Specific Provider

Custom Provider Endpoint

Retry Triggers