Ready to unlock the full power of the AI Gateway? This guide will walk you through creating custom routers with load balancing, caching, and rate limiting. You’ll go from basic routing to production-ready configurations.

Prerequisites: Make sure you’ve completed the main quickstart and have the gateway running with your API keys configured.

What Are Routers?

Think of routers as separate “virtual gateways” within your single AI Gateway deployment. Each router has its own:

  • URL endpoint - http://localhost:8080/router/{name}
  • Load balancing strategy - How requests are distributed across providers
  • Provider pool - Which LLM providers are available
  • Features - Caching, rate limiting, retries, and more

This lets you have different configurations for different use cases - all from one gateway deployment.

Create Your First Router

Step 1: Basic Router Setup

Let’s start with a basic router configuration. Create a file called ai-gateway-config.yaml:

routers:
  my-router:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic

What this does:

  • Creates a router named my-router
  • Available at http://localhost:8080/router/my-router
  • Uses latency-based load balancing between OpenAI and Anthropic
  • Automatically routes to whichever provider responds fastest
1

Save the configuration

Save the YAML above as ai-gateway-config.yaml in your current directory.

2

Restart the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml
3

Test your router

import { OpenAI } from "openai";

const openai = new OpenAI({
  baseURL: "http://localhost:8080/router/my-router",
  apiKey: "fake-api-key", // Required by SDK, but gateway handles real auth
});

const response = await openai.chat.completions.create({
  model: "openai/gpt-4o-mini",
  messages: [{ role: "user", content: "Hello from my custom router!" }],
});

console.log(response);

🎉 Success! Your request was automatically load-balanced between OpenAI and Anthropic based on which responds faster.

Step 2: Add Intelligent Caching

Now let’s add caching to dramatically reduce costs and improve response times:

cache-store:
  in-memory: {}

routers:
  my-router:
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
    cache:
      directive: "max-age=3600"

What this adds:

  • Caches identical requests for 1 hour
  • Subsequent identical requests return instantly from cache
  • Can reduce costs by 90%+ for repeated requests
1

Update your configuration

Replace your ai-gateway-config.yaml with the configuration above.

2

Restart the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml
3

Test caching

Make the same request twice and notice the second one is much faster:

# First request - goes to provider
time curl -X POST http://localhost:8080/router/my-router/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "To be or not to be?"}]}'

# Second request - returns from cache instantly
time curl -X POST http://localhost:8080/router/my-router/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "To be or not to be?"}]}'

Step 3: Rate Limit per Environment

Real applications need different rate limits for different environments. Rate limiting in the AI Gateway works per-API-key, which requires authentication to identify users. Let’s set up Helicone authentication and create production and development routers with appropriate protections.

Authentication Required: Rate limiting is applied per-API-key, so you need Helicone authentication enabled to track and limit requests for different users.

First, get your Helicone API key:

  1. Go to Helicone Settings
  2. Click “Generate New Key”
  3. Copy the key (starts with sk-helicone-)
  4. Set it as an environment variable:
export HELICONE_CONTROL_PLANE_API_KEY="sk-helicone-your-api-key"

Now create the configuration with authentication and rate limits:

helicone:
  authentication: true
  observability: false # Set to true to enable observability

cache-store:
  in-memory: {}

routers:
  production:
    rate-limit:
      per-api-key:
        capacity: 1000
        refill-frequency: 1m # 1000 requests per minute
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
    cache:
      directive: "max-age=1800" # 30 minutes for production freshness

  development:
    rate-limit:
      per-api-key:
        capacity: 100
        refill-frequency: 1h # 100 requests per hour for cost safety
    load-balance:
      chat:
        strategy: latency
        providers:
          - openai
          - anthropic
    cache:
      directive: "max-age=7200" # 2 hours to reduce dev costs

What this creates:

RouterEndpointRate LimitUse Case
production/router/production...1000/minHigh-traffic customer requests
development/router/development...100/hourCost-controlled development
1

Update configuration

Replace your ai-gateway-config.yaml with the multi-environment config above.

2

Restart the gateway

npx @helicone/ai-gateway@latest --config ai-gateway-config.yaml
3

Test your routers

Now requests require authentication. Test each environment:

# Production router
curl -X POST http://localhost:8080/router/production/chat/completions \
  -H "Authorization: Bearer sk-helicone-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Production test"}]}'

# Development router
curl -X POST http://localhost:8080/router/development/chat/completions \
  -H "Authorization: Bearer sk-helicone-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "openai/gpt-4o-mini", "messages": [{"role": "user", "content": "Development test"}]}'

Step 4: Use in Your Applications

Just change the base URL to use different routers. Remember to use your Helicone API key for authentication:

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/router/production",
    api_key="sk-helicone-..."  # Your Helicone API key
)

response = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

Key Concepts You’ve Learned

Click on these if you’d like to dive deeper:

What’s Next?

You now have a solid foundation with custom routers! Here are the next steps to explore:

Deploy Your Router

Learn how to deploy your router to production