What are prompt experiments?

Helicone’s prompt experiments are designed to help you tune your LLM prompt, test it on production data, and verify your iterations with quantifiable data.

In Helicone, you can define and run LLM-as-a-judge or custom evaluators, then compare the results between prompts side-by-side to determine the best one for production.

Why experiment with prompts?

  • Prevent regression: Prompt engineering is iterative. Engineers want to prevent regression with each prompt change.
  • Model update: LLMs are sensitive to prompt changes. As new models with advanced capabilities come out, prompts that worked previously can become less effective.
  • Quick feedback loop: Instead of waiting for user feedback, get feedback immediately and adjust before pushing to production.
  1. Create a new experiment in Helicone
  2. Create new prompt variations
  3. Add input rows
  4. Create and run custom evaluators
  5. Push changes to production

Quick Start

1. Create a new experiment

There are a few ways to do this:

Method 1: Start with a prompt

If you have an existing prompt in Helicone, head to the Experiment tab. Click on start from a prompt, then choose the desired prompt version.

Method 2: Start with a request

If you don’t have an existing prompt, we recommend choosing this method or starting from scratch (method 3).

Head to the Requests tab. Open the desired request, then click on the experiments icon. You should see an experiment being generated for the prompt associated with this request.

Method 3: Start from scratch

Head to the Experiment tab, then click on Start from scratch. A helper prompt will be generated for you; you can edit it by clicking on the cell.

2. Create new prompt variations

To create a new prompt, click Add column and select a prompt that you want to fork from.

Keep in mind that you can only fork from an existing prompt in the Experiment.

Add prompt variables

Type {{ input_name }} to add input variables in your prompt. These input variables will appear in the Inputs column.

3. Add input rows

The next step is to import golden datasets or your request data in Helicone. There are four ways:

  • Add manual inputs: Manually enter values for each input variable you defined.
  • Select an input set: Select production request data that matches the same prompt ID.
  • Random prod: Randomly select any number of production request data. We wrote about why this approach is recommended.
  • Add from a dataset (coming soon): Soon, you can use datasets created in Helicone to test your prompt.

4. Create and run custom evaluators

1

Toggle on `Show scores`.

2

Under the dropdown, select 'Create new custom evaluators'

3

Create the evaluator

On the side panel, you will be able to create a new evaluator. We currently support LLM-as-a-judge; Python and TypeScript support is coming soon!

You can add as many evaluators as you like, and run them all at the same time.

4

Run Evaluator

Once you add an evaluator, you will notice a warning that prompts you to re-run evaluation. Click on Run Evaluators to see the scores in graph.

5

(Optional) View score breakdown

To see the scores breakdown by input, click on an evaluator (humor in this example). You will see individual scores appear in each cell. Notice that cells that perform above average will have a green indicator. Cells below average will have a red indicator.

5. Push changes to production

The more prompts you create, the more data points on the scores graph you will see. Keep in mind that prompt engineering is an iterative process. The more input you test, the more edge cases you can cover with your new prompts. Once you are happy with a prompt, copy and paste it into your code for production.