When building AI applications that need domain-specific knowledge or consistent formatting, generic models often fall short. Fine-tuning lets you train models on your curated datasets from Helicone, creating specialized AI that understands your unique requirements and consistently delivers the exact outputs your application needs.

Why use Fine-tuning

  • Create domain experts: Train models on your specific data to excel at specialized tasks like legal analysis or medical coding
  • Ensure consistent outputs: Get reliable formatting and structure that matches your exact specifications every time
  • Reduce costs and latency: Smaller fine-tuned models often outperform larger generic models at specific tasks
Dataset curation interface showing request filtering and evaluation for fine-tuning

Helicone's dataset curation interface for preparing fine-tuning data

Quick Start

1

Curate your dataset

Filter and select high-quality request-response pairs from your Helicone logs:
// Navigate to Datasets in Helicone dashboard
// Apply filters to find ideal training examples:
// - Status: 200 (successful requests only)
// - Model: gpt-4o-mini (or your preferred base model)
// - Custom properties for specific use cases
2

Export for fine-tuning

Export your curated dataset in the format your fine-tuning platform requires:
# Export as JSONL for OpenAI
helicone datasets export --format openai --id dataset-123

# Export for other platforms
helicone datasets export --format anthropic --id dataset-123
helicone datasets export --format cohere --id dataset-123
3

Create fine-tuning job

Use your exported dataset with your chosen platform:
# OpenAI fine-tuning example
import openai

# Upload training file
file = openai.File.create(
  file=open("helicone-dataset.jsonl", "rb"),
  purpose='fine-tune'
)

# Create fine-tuning job
job = openai.FineTuningJob.create(
  training_file=file.id,
  model="gpt-4o-mini"
)

Configuration Options

Basic Settings

Core options for dataset curation and export:
SettingTypeDescriptionDefaultExample
minScorenumberMinimum score for including requests00.8
maxTokensnumberMaximum tokens per example4096
formatstringExport format for platform"openai""anthropic"
includeMetadatabooleanInclude Helicone metadatafalsetrue

Advanced Settings

SettingTypeDescriptionDefaultExample
validationSplitnumberPercentage for validation set0.20.15
deduplicationbooleanRemove duplicate examplestruefalse
samplingobjectSampling configuration{}{"method": "random", "size": 1000}

Detailed Explanations

Use Cases

Fine-tune a model on your best support conversations for consistent, brand-aligned responses:
// 1. Tag high-quality support conversations
const response = await openai.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: supportConversation
  },
  {
    headers: {
      "Helicone-Property-Quality": "excellent",
      "Helicone-Property-UseCase": "support",
      "Helicone-Property-Resolved": "true"
    }
  }
);

// 2. Create dataset from tagged conversations
const dataset = await helicone.datasets.create({
  name: "support-agent-training",
  filters: {
    properties: {
      Quality: "excellent",
      UseCase: "support",
      Resolved: "true"
    },
    minScore: 0.9
  }
});

// 3. Fine-tune on curated examples
const fineTunedModel = await trainModel(dataset);

Understanding Fine-tuning

When to Use Fine-tuning vs RAG

Fine-tuning and RAG (Retrieval Augmented Generation) solve different problems: What Fine-tuning is best for:
  • Teaching consistent behavior and output formats
  • Improving performance on specific tasks
  • Reducing latency and costs with smaller models
  • Encoding domain knowledge into the model
What RAG is best for:
  • Working with frequently changing information
  • Handling large knowledge bases
  • Maintaining source attribution
  • Quick updates without retraining
// ✅ Good fine-tuning use case
// Teaching a model to always format API responses consistently
const apiResponseExample = {
  input: "Get user data for ID 123",
  output: {
    status: "success",
    data: { id: 123, name: "John" },
    metadata: { timestamp: "2024-01-01T00:00:00Z" }
  }
};

// ❌ Poor fine-tuning use case  
// Trying to teach current product inventory (use RAG instead)
const inventoryExample = {
  input: "What's in stock?",
  output: "SKU-123: 45 units, SKU-456: 0 units" // This changes daily!
};

Dataset Quality Guidelines

The quality of your fine-tuning dataset determines model performance: Key principles:
  • Diversity: Include varied examples covering edge cases
  • Consistency: Ensure similar inputs have similar outputs
  • Quality: Only include high-quality, verified examples
  • Quantity: Start with 50-100 examples minimum
Quality indicators in Helicone:
// High-quality dataset criteria
const qualityDataset = {
  filters: {
    // Performance indicators
    responseTime: { max: 3000 },
    tokenCount: { min: 50, max: 2000 },
    
    // Quality signals
    scores: { 
      userFeedback: { min: 4 },
      accuracy: { min: 0.9 }
    },
    
    // Consistency checks
    status: "success",
    errorRate: { max: 0.01 }
  }
};

Fine-tuning Workflow Best Practices

Data preparation checklist:
  • Review examples for quality and consistency
  • Remove PII and sensitive information
  • Balance dataset across different use cases
  • Validate format matches platform requirements
Monitoring fine-tuned models:
// Track fine-tuned model performance
const response = await openai.chat.completions.create(
  {
    model: "ft:gpt-4o-mini:org-name:model-name:id",
    messages: [{ role: "user", content: prompt }]
  },
  {
    headers: {
      "Helicone-Property-Model-Type": "fine-tuned",
      "Helicone-Property-Base-Model": "gpt-4o-mini",
      "Helicone-Property-Training-Job": "ftjob-123"
    }
  }
);

// Compare against base model performance