Helicone smoothly integrates streaming functionality and offers benefits that you can’t find with the standard OpenAI package!

Currently, OpenAI doesn’t provide usage statistics such as prompt and completion tokens. However, Helicone overcomes this limitation by estimating these statistics with the help of the gpt3-tokenizer package, which is designed to work with all tokenized OpenAI GPT models.

All you have to do is import helicone or import openai through the Helicone package and the rest of your code works as it does.

from helicone.openai_proxy import openai

The following examples work with or without Helicone!

Streaming mode with synchronous requests

In this mode, the request is made synchronously, but the response is streamed.

for chunk in openai.ChatCompletion.create(
    model = 'gpt-3.5-turbo',
    messages = [{
        'role': 'user',
        'content': "Hello World!"
    }],
    stream=True
):
    content = chunk["choices"][0].get("delta", {}).get("content")
    if content is not None:
        print(content, end='')

Streaming mode with asynchronous requests

In this mode, both the request is made asynchronously and the response is streamed. You’ll need to use the await keyword when calling openai.ChatCompletion.acreate, and use an async for loop to iterate over the response.

for chunk in await openai.ChatCompletion.acreate(
    model = 'gpt-3.5-turbo',
    messages = [{
        'role': 'user',
        'content': "Hello World!"
    }],
    stream=True
):
    content = chunk["choices"][0].get("delta", {}).get("content")
    if content is not None:
        print(content, end='')

Enhanced Streaming Support

Helicone now provides significantly improved streaming functionality with several key updates:

Stream Fixes and Improvements

We’ve made several improvements to our stream handling across different LLM providers:

  • Better handling of stream interruptions and reconnections
  • Enhanced error handling for streaming responses
  • Improved compatibility with different LLM provider streaming formats
  • More reliable token counting for streamed content
  • Accurate timing calculations for streamed responses

New Streaming Methods

The HeliconeManualLogger class now includes enhanced methods for working with streams:

  • logStream: Logs a streaming operation with full control over stream handling
  • logSingleStream: Simplified method for logging a single ReadableStream
  • logSingleRequest: Logs a single request with a response body

Asynchronous Stream Parser

Our new asynchronous stream parser significantly improves performance when working with streamed responses:

  • Processes stream chunks asynchronously for reduced latency
  • Provides more reliable token counting for streamed responses
  • Accurately captures time-to-first-token metrics
  • Efficiently handles multiple concurrent streams

Using the Enhanced Streaming Features

OpenAI Streaming Example

import OpenAI from "openai";
import { HeliconeManualLogger } from "@helicone/helpers";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const helicone = new HeliconeManualLogger({
  apiKey: process.env.HELICONE_API_KEY!,
  headers: {
    "Helicone-Property-Environment": "production",
  },
});

async function generateStreamingResponse(prompt: string, userId: string) {
  const requestBody = {
    model: "gpt-4-turbo",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  };

  const response = await openai.chat.completions.create(requestBody);

  // For OpenAI's Node.js SDK, we can use the logSingleStream method
  const stream = response.toReadableStream();
  const [streamForUser, streamForLogging] = stream.tee();

  helicone.logSingleStream(requestBody, streamForLogging, {
    "Helicone-User-Id": userId,
  });

  return streamForUser;
}

Together AI Streaming Example

import Together from "together-ai";
import { HeliconeManualLogger } from "@helicone/helpers";

const together = new Together({ apiKey: process.env.TOGETHER_API_KEY });

const helicone = new HeliconeManualLogger({
  apiKey: process.env.HELICONE_API_KEY!,
  headers: {
    "Helicone-Property-Environment": "production",
  },
});

export async function generateWithTogetherAI(prompt: string, userId: string) {
  const body = {
    model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  };

  const response = await together.chat.completions.create(body);

  // Create two copies of the stream
  const [stream1, stream2] = response.tee();

  // Log the stream with Helicone
  helicone.logStream(
    body,
    async (resultRecorder) => {
      resultRecorder.attachStream(stream2.toReadableStream());
      return stream1;
    },
    { "Helicone-User-Id": userId }
  );

  return new Response(stream1.toReadableStream());
}

Anthropic Streaming Example

import Anthropic from "@anthropic-ai/sdk";
import { HeliconeManualLogger } from "@helicone/helpers";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

const helicone = new HeliconeManualLogger({
  apiKey: process.env.HELICONE_API_KEY!,
  headers: {
    "Helicone-Property-Environment": "production",
  },
});

async function generateWithAnthropic(prompt: string, userId: string) {
  const requestBody = {
    model: "claude-3-opus-20240229",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  };

  const response = await anthropic.messages.create(requestBody);
  const stream = response.toReadableStream();
  const [userStream, loggingStream] = stream.tee();

  helicone.logSingleStream(requestBody, loggingStream, {
    "Helicone-User-Id": userId,
  });

  return userStream;
}

Calculating Costs with Streaming

For information on how to accurately calculate costs when using streaming features, please refer to our streaming usage guide.

You can enable accurate cost calculation by either:

  1. Including stream_options: { include_usage: true } in your request
  2. Adding the helicone-stream-usage: true header to your request

This ensures that token usage is properly tracked even when using streaming responses.

Vercel App Router Integration

When using Next.js App Router with Vercel, you can use the after function to log streaming responses without blocking the response to the client:

import { HeliconeManualLogger } from "@helicone/helpers";
import { after } from "next/server";
import Together from "together-ai";

export async function POST(request: Request) {
  const { prompt } = await request.json();

  const together = new Together({ apiKey: process.env.TOGETHER_API_KEY });
  const helicone = new HeliconeManualLogger({
    apiKey: process.env.HELICONE_API_KEY!,
  });

  // Create a streaming request
  const requestBody = {
    model: "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  };

  const response = await together.chat.completions.create(requestBody);
  const [stream1, stream2] = response.tee();

  // Log the stream after sending the response to the client
  after(helicone.logSingleStream(requestBody, stream2.toReadableStream()));

  return new Response(stream1.toReadableStream());
}

This approach ensures that logging doesn’t delay the response to the user, providing the best possible experience while still capturing all the necessary data.

Learn More

For a comprehensive guide on using the Manual Logger with streaming functionality, check out our Manual Logger with Streaming cookbook.