> ## Documentation Index
> Fetch the complete documentation index at: https://docs.helicone.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Eval Scores

When running evaluation frameworks to measure model performance, you need visibility into how well your AI applications are performing across different metrics. Scores let you report evaluation results from any framework to Helicone, providing centralized observability for accuracy, hallucination rates, helpfulness, and custom metrics.

<Note>
  Helicone doesn't run evaluations for you - we're not an evaluation framework. Instead, we provide a centralized location to report and analyze evaluation results from any framework (like RAGAS, LangSmith, or custom evaluations), giving you unified observability across all your evaluation metrics.
</Note>

## Why use Scores

* **Centralize evaluation results**: Report scores from any evaluation framework for unified monitoring and analysis
* **Track model performance over time**: Visualize how accuracy, hallucination rates, and other metrics evolve
* **Compare experiments side-by-side**: Evaluate different prompts, models, or configurations with consistent metrics

## Quick Start

<Steps>
  <Step title="Run your evaluation">
    Use your evaluation framework or custom logic to assess model responses and generate scores (integers or booleans) for metrics like accuracy, helpfulness, or safety.
  </Step>

  <Step title="Report scores to Helicone">
    Send evaluation results using the Helicone API:

    ```typescript theme={null}
    // Get the request ID from response headers
    const requestId = response.headers.get("helicone-id");

    // Report evaluation scores
    await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${HELICONE_API_KEY}`,
        "Content-Type": "application/json"
      },
      body: JSON.stringify({
        scores: {
          "accuracy": 92,        // Integer values required
          "hallucination": 8,    // Converted to integers (0.08 * 100)
          "helpfulness": 85,
          "is_safe": true        // Booleans supported
        }
      })
    });
    ```

    <Accordion title="Alternative: Add scores via dashboard">
      You can also add scores directly in the Helicone dashboard on the request details page. This is useful for manual evaluation or quick testing.
    </Accordion>
  </Step>

  <Step title="View score analytics">
    Analyze evaluation results in the Helicone dashboard to track performance trends, compare experiments, and identify areas for improvement.
  </Step>
</Steps>

<Warning>
  Scores are processed with a **10 minute delay** by default for analytics aggregation.
</Warning>

## API Format

### Request Structure

The scores API expects this exact format:

| Field    | Type     | Description                           | Required | Example            |
| -------- | -------- | ------------------------------------- | -------- | ------------------ |
| `scores` | `object` | Key-value pairs of evaluation metrics | ✅ Yes    | `{"accuracy": 92}` |

### Score Values

| Type      | Description                     | Example         |
| --------- | ------------------------------- | --------------- |
| `integer` | Numeric scores (no decimals)    | `92`, `85`, `0` |
| `boolean` | Pass/fail or true/false metrics | `true`, `false` |

<Warning>
  Float values like `0.92` are rejected. Convert to integers: `0.92` → `92`
</Warning>

## Use Cases

<Tabs>
  <Tab title="RAG Accuracy Evaluation">
    Evaluate retrieval-augmented generation for accuracy and hallucination:

    <CodeGroup>
      ```python Python theme={null}
      import requests
      from ragas import evaluate
      from ragas.metrics import Faithfulness, ResponseRelevancy
      from datasets import Dataset

      # Run RAG evaluation
      def evaluate_rag_response(question, answer, contexts, ground_truth, requestId):
          # Initialize RAGAS metrics
          metrics = [Faithfulness(), ResponseRelevancy()]
          
          # Create dataset in RAGAS format
          data = {
              "question": [question],
              "answer": [answer], 
              "contexts": [contexts],
              "ground_truth": [ground_truth]
          }
          dataset = Dataset.from_dict(data)
          
          # Run evaluation
          result = evaluate(dataset, metrics=metrics)
          
          # Extract scores (RAGAS returns 0-1 values)
          faithfulness_score = result['faithfulness'] if 'faithfulness' in result else 0
          relevancy_score = result['answer_relevancy'] if 'answer_relevancy' in result else 0
          
          # Report to Helicone (convert to 0-100 scale)
          response = requests.post(
              f"https://api.helicone.ai/v1/request/{requestId}/score",
              headers={
                  "Authorization": f"Bearer {HELICONE_API_KEY}",
                  "Content-Type": "application/json"
              },
              json={
                  "scores": {
                      "faithfulness": int(faithfulness_score * 100),
                      "answer_relevancy": int(relevancy_score * 100)
                  }
              }
          )
          
          return result

      # Example usage
      scores = evaluate_rag_response(
          question="What is the capital of France?",
          answer="The capital of France is Paris.",
          contexts=["France is a country in Europe. Paris is its capital."],
          ground_truth="Paris",
          requestId="your-request-id-here"
      )
      ```

      ```typescript TypeScript theme={null}
      // RAG evaluation with custom metrics
      async function evaluateRAGResponse(
        question: string,
        answer: string,
        contexts: string[],
        requestId: string
      ) {
        // Custom evaluation logic
        const scores = {
          relevance: calculateRelevance(answer, question),
          groundedness: checkGroundedness(answer, contexts),
          completeness: measureCompleteness(answer, question),
          hallucination: detectHallucination(answer, contexts)
        };
        
        // Report to Helicone
        await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
          method: "POST",
          headers: {
            "Authorization": `Bearer ${HELICONE_API_KEY}`,
            "Content-Type": "application/json"
          },
          body: JSON.stringify({ scores })
        });
        
        // Alert on poor performance
        if (scores.hallucination > 0.2) {
          console.warn("High hallucination detected:", scores);
        }
        
        return scores;
      }
      ```
    </CodeGroup>
  </Tab>

  <Tab title="Code Generation Quality">
    Evaluate code generation for correctness, style, and functionality:

    ```typescript theme={null}
    // Evaluate generated code quality
    async function evaluateCodeGeneration(
      prompt: string,
      generatedCode: string,
      requestId: string
    ) {
      const scores = {
        // Syntax validity
        syntax_valid: await validateSyntax(generatedCode) ? 1.0 : 0.0,
        
        // Test pass rate
        test_pass_rate: await runTests(generatedCode),
        
        // Code quality metrics
        complexity: calculateCyclomaticComplexity(generatedCode),
        readability: assessReadability(generatedCode),
        
        // Security checks
        security_score: await runSecurityScan(generatedCode),
        
        // Performance benchmarks
        performance: await benchmarkCode(generatedCode)
      };
      
      // Report comprehensive evaluation
      await fetch(`https://api.helicone.ai/v1/request/${requestId}/score`, {
        method: "POST",
        headers: {
          "Authorization": `Bearer ${HELICONE_API_KEY}`,
          "Content-Type": "application/json"
        },
        body: JSON.stringify({
          scores: {
            ...scores,
            // Convert any decimal scores to integers
            test_pass_rate: Math.round(scores.test_pass_rate * 100)
          }
        })
      });
      
      return scores;
    }
    ```
  </Tab>

  <Tab title="Helpfulness & Safety">
    Evaluate model outputs for helpfulness, safety, and alignment:

    ```python theme={null}
    # Multi-dimensional evaluation for chatbots
    async def evaluate_chat_response(user_query, assistant_response, requestId):
        # Use LLM as judge for subjective metrics
        eval_prompt = f"""
        Rate the following assistant response on these criteria (0-1):
        - Helpfulness: How well does it address the user's question?
        - Safety: Is the response safe and appropriate?
        - Accuracy: Is the information correct?
        - Clarity: Is the response clear and well-structured?
        
        User: {user_query}
        Assistant: {assistant_response}
        """
        
        # Get evaluation from judge model
        eval_scores = await llm_judge(eval_prompt)
        
        # Add objective metrics
        scores = {
            **eval_scores,
            "response_length": len(assistant_response),
            "reading_level": calculate_reading_level(assistant_response),
            "contains_refusal": "I cannot" in assistant_response or "I won't" in assistant_response
        }
        
        # Report all scores (convert decimals to integers)
        integer_scores = {
            key: int(value * 100) if isinstance(value, float) and 0 <= value <= 1 else value
            for key, value in scores.items()
        }
        
        response = requests.post(
            f"https://api.helicone.ai/v1/request/{requestId}/score",
            headers={
                "Authorization": f"Bearer {HELICONE_API_KEY}",
                "Content-Type": "application/json"
            },
            json={"scores": integer_scores}
        )
        
        return scores
    ```
  </Tab>
</Tabs>

## Related Features

<CardGroup cols={2}>
  <Card title="Experiments" icon="flask" href="/features/experiments">
    Compare different configurations with consistent scoring
  </Card>

  <Card title="Sessions" icon="link" href="/features/sessions">
    Evaluate multi-turn conversations and workflows
  </Card>

  <Card title="Custom Properties" icon="tag" href="/features/advanced-usage/custom-properties">
    Tag requests for segmented evaluation analysis
  </Card>

  <Card title="Webhooks" icon="webhook" href="/features/webhooks">
    Trigger evaluations automatically when requests complete
  </Card>
</CardGroup>

<Snippet file="questions-section.mdx" />
