> ## Documentation Index > Fetch the complete documentation index at: https://docs.helicone.ai/llms.txt > Use this file to discover all available pages before exploring further. # Eval Scores When running evaluation frameworks to measure model performance, you need visibility into how well your AI applications are performing across different metrics. Scores let you report evaluation results from any framework to Helicone, providing centralized observability for accuracy, hallucination rates, helpfulness, and custom metrics. Helicone doesn't run evaluations for you - we're not an evaluation framework. Instead, we provide a centralized location to report and analyze evaluation results from any framework (like RAGAS, LangSmith, or custom evaluations), giving you unified observability across all your evaluation metrics. ## Why use Scores * **Centralize evaluation results**: Report scores from any evaluation framework for unified monitoring and analysis * **Track model performance over time**: Visualize how accuracy, hallucination rates, and other metrics evolve * **Compare experiments side-by-side**: Evaluate different prompts, models, or configurations with consistent metrics ## Quick Start Use your evaluation framework or custom logic to assess model responses and generate scores (integers or booleans) for metrics like accuracy, helpfulness, or safety. Send evaluation results using the Helicone API: ```typescript theme={null} // Get the request ID from response headers const requestId = response.headers.get("helicone-id"); // Report evaluation scores await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, { method: "POST", headers: { "Authorization": `Bearer ${HELICONE_API_KEY}`, "Content-Type": "application/json" }, body: JSON.stringify({ scores: { "accuracy": 92, // Integer values required "hallucination": 8, // Converted to integers (0.08 * 100) "helpfulness": 85, "is_safe": true // Booleans supported } }) }); ``` You can also add scores directly in the Helicone dashboard on the request details page. This is useful for manual evaluation or quick testing. Analyze evaluation results in the Helicone dashboard to track performance trends, compare experiments, and identify areas for improvement. Scores are processed with a **10 minute delay** by default for analytics aggregation. ## API Format ### Request Structure The scores API expects this exact format: | Field | Type | Description | Required | Example | | -------- | -------- | ------------------------------------- | -------- | ------------------ | | `scores` | `object` | Key-value pairs of evaluation metrics | ✅ Yes | `{"accuracy": 92}` | ### Score Values | Type | Description | Example | | --------- | ------------------------------- | --------------- | | `integer` | Numeric scores (no decimals) | `92`, `85`, `0` | | `boolean` | Pass/fail or true/false metrics | `true`, `false` | Float values like `0.92` are rejected. Convert to integers: `0.92` → `92` ## Use Cases Evaluate retrieval-augmented generation for accuracy and hallucination: ```python Python theme={null} import requests from ragas import evaluate from ragas.metrics import Faithfulness, ResponseRelevancy from datasets import Dataset # Run RAG evaluation def evaluate_rag_response(question, answer, contexts, ground_truth, requestId): # Initialize RAGAS metrics metrics = [Faithfulness(), ResponseRelevancy()] # Create dataset in RAGAS format data = { "question": [question], "answer": [answer], "contexts": [contexts], "ground_truth": [ground_truth] } dataset = Dataset.from_dict(data) # Run evaluation result = evaluate(dataset, metrics=metrics) # Extract scores (RAGAS returns 0-1 values) faithfulness_score = result['faithfulness'] if 'faithfulness' in result else 0 relevancy_score = result['answer_relevancy'] if 'answer_relevancy' in result else 0 # Report to Helicone (convert to 0-100 scale) response = requests.post( f"https://api.helicone.ai/v1/request/{requestId}/score", headers={ "Authorization": f"Bearer {HELICONE_API_KEY}", "Content-Type": "application/json" }, json={ "scores": { "faithfulness": int(faithfulness_score * 100), "answer_relevancy": int(relevancy_score * 100) } } ) return result # Example usage scores = evaluate_rag_response( question="What is the capital of France?", answer="The capital of France is Paris.", contexts=["France is a country in Europe. Paris is its capital."], ground_truth="Paris", requestId="your-request-id-here" ) ``` ```typescript TypeScript theme={null} // RAG evaluation with custom metrics async function evaluateRAGResponse( question: string, answer: string, contexts: string[], requestId: string ) { // Custom evaluation logic const scores = { relevance: calculateRelevance(answer, question), groundedness: checkGroundedness(answer, contexts), completeness: measureCompleteness(answer, question), hallucination: detectHallucination(answer, contexts) }; // Report to Helicone await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, { method: "POST", headers: { "Authorization": `Bearer ${HELICONE_API_KEY}`, "Content-Type": "application/json" }, body: JSON.stringify({ scores }) }); // Alert on poor performance if (scores.hallucination > 0.2) { console.warn("High hallucination detected:", scores); } return scores; } ``` Evaluate code generation for correctness, style, and functionality: ```typescript theme={null} // Evaluate generated code quality async function evaluateCodeGeneration( prompt: string, generatedCode: string, requestId: string ) { const scores = { // Syntax validity syntax_valid: await validateSyntax(generatedCode) ? 1.0 : 0.0, // Test pass rate test_pass_rate: await runTests(generatedCode), // Code quality metrics complexity: calculateCyclomaticComplexity(generatedCode), readability: assessReadability(generatedCode), // Security checks security_score: await runSecurityScan(generatedCode), // Performance benchmarks performance: await benchmarkCode(generatedCode) }; // Report comprehensive evaluation await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, { method: "POST", headers: { "Authorization": `Bearer ${HELICONE_API_KEY}`, "Content-Type": "application/json" }, body: JSON.stringify({ scores: { ...scores, // Convert any decimal scores to integers test_pass_rate: Math.round(scores.test_pass_rate * 100) } }) }); return scores; } ``` Evaluate model outputs for helpfulness, safety, and alignment: ```python theme={null} # Multi-dimensional evaluation for chatbots async def evaluate_chat_response(user_query, assistant_response, requestId): # Use LLM as judge for subjective metrics eval_prompt = f""" Rate the following assistant response on these criteria (0-1): - Helpfulness: How well does it address the user's question? - Safety: Is the response safe and appropriate? - Accuracy: Is the information correct? - Clarity: Is the response clear and well-structured? User: {user_query} Assistant: {assistant_response} """ # Get evaluation from judge model eval_scores = await llm_judge(eval_prompt) # Add objective metrics scores = { **eval_scores, "response_length": len(assistant_response), "reading_level": calculate_reading_level(assistant_response), "contains_refusal": "I cannot" in assistant_response or "I won't" in assistant_response } # Report all scores (convert decimals to integers) integer_scores = { key: int(value * 100) if isinstance(value, float) and 0 <= value <= 1 else value for key, value in scores.items() } response = requests.post( f"https://api.helicone.ai/v1/request/{requestId}/score", headers={ "Authorization": f"Bearer {HELICONE_API_KEY}", "Content-Type": "application/json" }, json={"scores": integer_scores} ) return scores ``` ## Related Features Compare different configurations with consistent scoring Evaluate multi-turn conversations and workflows Tag requests for segmented evaluation analysis Trigger evaluations automatically when requests complete