OpenAI Non-Streaming

OpenAI Non-Streaming are requests made to the OpenAI API where the entire response is delivered in a single payload rather than in a series of streamed chunks.

For these non-streaming requests, OpenAI provides a usage tag in the response, which includes data such as the number of prompt tokens, completion tokens, and total tokens used.

Here is an example of how the usage tag might look in a response:

"usage": {
	"prompt_tokens": 11,
	"completion_tokens": 9,
	"total_tokens": 20
},

We capture this data, and we estimate the cost based on the model returned in the response body, using OpenAI’s pricing tables.

OpenAI Streaming

Unlike non-streaming requests, OpenAI streaming requests do not return the usage tag. Therefore, we rely on 3rd party tokenizers like TikToken to count our tokens. The calculation becomes complex when handling Chat messages, as hidden tokens are abstracted away from the user. To accurately estimate the amount of prompt and completion tokens, we have reverse-engineered the hidden tokens.

Anthropic Requests

In the case of Anthropic requests, there is no supported method for calculating tokens in Typescript. So, we have to manually calculate the tokens using a Python server. For more discussion and details on this topic, see our comments in this thread: https://github.com/anthropics/anthropic-sdk-typescript/issues/16

Developer

For a detailed look at how we calculate LLM costs, please follow this link: https://github.com/Helicone/helicone/tree/main/costs

Please note that these methods are based on our current understanding and may be subject to changes in the future as APIs and token counting methodologies evolve.

Questions?

Questions or feedback? Reach out to help@helicone.ai or schedule a call with us.