Eval Scores

When running evaluation frameworks to measure model performance, you need visibility into how well your AI applications are performing across different metrics. Scores let you report evaluation results from any framework to Helicone, providing centralized observability for accuracy, hallucination rates, helpfulness, and custom metrics.

Helicone doesn’t run evaluations for you - we’re not an evaluation framework. Instead, we provide a centralized location to report and analyze evaluation results from any framework (like RAGAS, LangSmith, or custom evaluations), giving you unified observability across all your evaluation metrics.

Why use Scores

Centralize evaluation results: Report scores from any evaluation framework for unified monitoring and analysis
Track model performance over time: Visualize how accuracy, hallucination rates, and other metrics evolve
Compare experiments side-by-side: Evaluate different prompts, models, or configurations with consistent metrics

Quick Start

Run your evaluation

Use your evaluation framework or custom logic to assess model responses and generate scores (integers or booleans) for metrics like accuracy, helpfulness, or safety.

Report scores to Helicone

Send evaluation results using the Helicone API:

// Get the request ID from response headers
const requestId = response.headers.get("helicone-id");

// Report evaluation scores
await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${HELICONE_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    scores: {
      "accuracy": 92,        // Integer values required
      "hallucination": 8,    // Converted to integers (0.08 * 100)
      "helpfulness": 85,
      "is_safe": true        // Booleans supported
    }
  })
});

Alternative: Add scores via dashboard

View score analytics

Analyze evaluation results in the Helicone dashboard to track performance trends, compare experiments, and identify areas for improvement.

Scores are processed with a 10 minute delay by default for analytics aggregation.

API Format

Request Structure

The scores API expects this exact format:

Field	Type	Description	Required	Example
`scores`	`object`	Key-value pairs of evaluation metrics	✅ Yes	`{"accuracy": 92}`

Score Values

Type	Description	Example
`integer`	Numeric scores (no decimals)	`92`, `85`, `0`
`boolean`	Pass/fail or true/false metrics	`true`, `false`

Float values like 0.92 are rejected. Convert to integers: 0.92 → 92

Use Cases

Evaluate retrieval-augmented generation for accuracy and hallucination:

import requests
from ragas import evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from datasets import Dataset

# Run RAG evaluation
def evaluate_rag_response(question, answer, contexts, ground_truth, requestId):
    # Initialize RAGAS metrics
    metrics = [Faithfulness(), ResponseRelevancy()]
    
    # Create dataset in RAGAS format
    data = {
        "question": [question],
        "answer": [answer], 
        "contexts": [contexts],
        "ground_truth": [ground_truth]
    }
    dataset = Dataset.from_dict(data)
    
    # Run evaluation
    result = evaluate(dataset, metrics=metrics)
    
    # Extract scores (RAGAS returns 0-1 values)
    faithfulness_score = result['faithfulness'] if 'faithfulness' in result else 0
    relevancy_score = result['answer_relevancy'] if 'answer_relevancy' in result else 0
    
    # Report to Helicone (convert to 0-100 scale)
    response = requests.post(
        f"https://api.helicone.ai/v1/request/{requestId}/score",
        headers={
            "Authorization": f"Bearer {HELICONE_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "scores": {
                "faithfulness": int(faithfulness_score * 100),
                "answer_relevancy": int(relevancy_score * 100)
            }
        }
    )
    
    return result

# Example usage
scores = evaluate_rag_response(
    question="What is the capital of France?",
    answer="The capital of France is Paris.",
    contexts=["France is a country in Europe. Paris is its capital."],
    ground_truth="Paris",
    requestId="your-request-id-here"
)

Experiments

Compare different configurations with consistent scoring

Sessions

Evaluate multi-turn conversations and workflows

Custom Properties

Tag requests for segmented evaluation analysis

Webhooks

Trigger evaluations automatically when requests complete

Getting Started

Integrations

Tracing

Prompts & Evals

Cloud AI Gateway

References

Why use Scores

Quick Start

API Format

Request Structure

Score Values

Use Cases

Experiments

Sessions

Custom Properties

Webhooks

Getting Started

Integrations

Tracing

Prompts & Evals

Cloud AI Gateway

References

​Why use Scores

​Quick Start

​API Format

​Request Structure

​Score Values

​Use Cases

​Related Features

Experiments

Sessions

Custom Properties

Webhooks

Why use Scores

Quick Start

API Format

Request Structure

Score Values

Use Cases

Related Features