Automatic Retries - Helicone OSS LLM Observability

Introduction

The AI Gateway automatically retries transient errors from AI providers using configurable strategies, improving reliability without overwhelming providers with rapid successive requests. Retries use smart failure detection with configurable maximum attempts, backoff timing, and are configurable either globally or per-router.

Why Use Retries

Improve reliability by automatically recovering from transient provider failures
Handle network issues by retrying requests that fail due to connectivity problems
Maintain user experience by transparently recovering from failures
Reduce manual intervention by automatically handling temporary service disruptions

Rate limit handling is automatic - when providers return 429 status codes, the AI Gateway automatically removes them from load balancing rotation. Retries are for handling other types of failures like 5xx errors or network issues.

Quick Start

Create your configuration

Create ai-gateway-config.yaml with basic retry configuration (3 retries with 50ms constant delay):

global:
  retries:
    strategy: "constant"
    delay: "50ms"
    max-retries: 3

Start the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml

Test retries

Send a request to a provider that might have transient failures. The gateway will automatically retry 5xx errors and network issues:

curl -X POST http://localhost:8080/ai/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

✅ If the provider returns a 5xx error or network issue, the gateway will automatically retry up to 3 times with 50ms delays!

For complete configuration options and syntax, see the Configuration Reference.

Use Cases

Use case: Production API that needs high reliability and can tolerate slightly higher latency for better success rates.

global:
  retries:
    strategy: "constant"
    delay: "50ms"
    max-retries: 3

Total retry time: Up to 7 seconds

How Retries Work

The AI Gateway automatically retries failed requests using the configured strategy with smart failure detection.

Request Fails

Request fails with a retryable error (5xx server error or network issue)

Wait Period

AI Gateway waits for the calculated backoff period based on the configured strategy

Retry Request

Request is retried with the same parameters to the same provider

Repeat or Return

Process repeats until success or max retries reached

Retry Strategies

Constant Strategy

Fixed delay between retry attempts with jitter to prevent thundering herd.

retries:
  strategy: "constant"
  delay: "50ms"
  max-retries: 3

Timing: 50ms → 100ms → 150ms → fail

Exponential Strategy

Exponentially increasing delays with jitter and configurable bounds.

retries:
  strategy: "exponential"
  min-delay: "200ms"
  max-delay: "60s"
  max-retries: 5
  factor: 2.0

Timing: 200ms → 400ms → 800ms → 1.6s → 3.2s → fail

What Gets Retried

Condition	Retried?	Reason	Examples
5xx Server Errors	✅ Yes	Temporary provider issues	`500 Internal Server Error`, `502 Bad Gateway`, `503 Service Unavailable`, `504 Gateway Timeout`
Network Transport Errors	✅ Yes	Connection/network problems	Connection refused, timeouts, DNS failures, TLS handshake errors
Stream Interruptions	✅ Yes	Streaming response failures	Stream ended unexpectedly, transport errors during streaming
4xx Client Errors	❌ No	Request format/auth issues	`400 Bad Request`, `401 Unauthorized`, `403 Forbidden`, `404 Not Found`, `422 Unprocessable Entity`
429 Rate Limits	❌ No	Handled by load balancing	Provider temporarily removed from rotation based on `Retry-After` header
2xx Success Responses	❌ No	Request succeeded	`200 OK`, `201 Created`, `202 Accepted`
Auth/Config Errors	❌ No	Setup/configuration issues	Invalid Helicone API keys, missing auth headers, provider not configured
Cache/Storage Errors	❌ No	Persistent storage issues	Cache operation failures, malformed request/response bodies

Load Balancing Integration

Retries and load balancing work together to maximize reliability:

Per-request: When a request fails, retries attempt the same provider multiple times
Per-provider: If a provider keeps failing requests, health monitoring removes it from the load balancer
Result: New requests automatically go to healthy providers while failed requests still get retried

Jitter and Backoff

All retry strategies include automatic jitter to prevent thundering herd problems:

Constant strategy: ±25% random variation in delay timing
Exponential strategy: ±25% random variation at each backoff level

You can override global retry settings for specific routers by adding a retries section to individual router configurations.

Guides

​Introduction

​Why Use Retries

​Quick Start

​Use Cases

​How Retries Work

​Retry Strategies

​Constant Strategy

​Exponential Strategy

​What Gets Retried

​Load Balancing Integration

​Jitter and Backoff

Introduction

Why Use Retries

Quick Start

Use Cases

How Retries Work

Retry Strategies

Constant Strategy

Exponential Strategy

What Gets Retried

Load Balancing Integration

Jitter and Backoff