When running evaluation frameworks to measure model performance, you need visibility into how well your AI applications are performing across different metrics. Scores let you report evaluation results from any framework to Helicone, providing centralized observability for accuracy, hallucination rates, helpfulness, and custom metrics.
Helicone doesn’t run evaluations for you - we’re not an evaluation framework. Instead, we provide a centralized location to report and analyze evaluation results from any framework (like RAGAS, LangSmith, or custom evaluations), giving you unified observability across all your evaluation metrics.
Run your evaluation
Use your evaluation framework or custom logic to assess model responses and generate scores (integers or booleans) for metrics like accuracy, helpfulness, or safety.
Report scores to Helicone
Send evaluation results using the Helicone API:
Alternative: Add scores via dashboard
You can also add scores directly in the Helicone dashboard on the request details page. This is useful for manual evaluation or quick testing.
View score analytics
Analyze evaluation results in the Helicone dashboard to track performance trends, compare experiments, and identify areas for improvement.
Scores are processed with a 10 minute delay by default for analytics aggregation.
The scores API expects this exact format:
Field | Type | Description | Required | Example |
---|---|---|---|---|
scores | object | Key-value pairs of evaluation metrics | ✅ Yes | {"accuracy": 92} |
Type | Description | Example |
---|---|---|
integer | Numeric scores (no decimals) | 92 , 85 , 0 |
boolean | Pass/fail or true/false metrics | true , false |
Float values like 0.92
are rejected. Convert to integers: 0.92
→ 92
Evaluate retrieval-augmented generation for accuracy and hallucination:
Evaluate retrieval-augmented generation for accuracy and hallucination:
Evaluate code generation for correctness, style, and functionality:
Evaluate model outputs for helpfulness, safety, and alignment:
Compare different configurations with consistent scoring
Evaluate multi-turn conversations and workflows
Tag requests for segmented evaluation analysis
Trigger evaluations automatically when requests complete