Helicone leverages Cloudflare Workers, which run code instantly across the globe on Cloudflare’s global network, to provide a fast and reliable proxy for your LLM requests. By utilizing this extensive network of servers, Helicone minimizes latency by ensuring that requests are handled by the servers closest to your users.

How Cloudflare Workers Minimize Latency

Cloudflare Workers operate on a serverless architecture running on Cloudflare’s global edge network. This means your requests are processed at the edge, reducing the distance data has to travel and significantly lowering latency. Workers are powered by V8 isolates, which are lightweight and have extremely fast startup times. This eliminates cold starts and ensures quick response times for your applications.

Benchmarking Helicone’s Proxy Service

To demonstrate the negligible latency introduced by Helicone’s proxy, we conducted the following experiment:

  • We interleaved 500 requests with unique prompts to both OpenAI and Helicone.
  • Both received the same requests within the same 1-second window, varying which endpoint was called first for each request.
  • We maximized the prompt context window to make these requests as large as possible.
  • We used the text-ada-001 model.
  • We logged the roundtrip latency for both sets of requests.

Results

StatisticOpenAI (s)Helicone (s)
Mean2.212.21
Median2.872.90
Standard Deviation1.121.12
Min0.140.14
Max3.563.76
p100.520.52
p903.273.29

The metrics show that Helicone’s latency closely matches that of direct requests to OpenAI. The slight differences at the right tail indicate a minimal overhead introduced by Helicone, which is negligible in most practical applications. This demonstrates that using Helicone’s proxy does not significantly impact the performance of your LLM requests.

FAQ