Guide

Datasets & Experiments

Datasets are collections of test cases with known inputs and expected outputs. Use them to systematically evaluate your agent against a fixed set of scenarios and catch regressions before they reach production.

When to Use Datasets

  • Regression testing — Verify your agent still handles known scenarios correctly after changes
  • Model comparison — Run the same test cases against different models to compare quality and cost
  • Prompt iteration — Measure the impact of prompt changes against a consistent benchmark
  • CI/CD gates — Block deploys when evaluation scores drop below thresholds

Creating a Dataset

Create datasets in the dashboard under your project:

Dashboard → Project → Datasets → Create Dataset

Each dataset contains items with:

FieldRequiredDescription
inputYesThe input to send to your agent
expectedOutputNoThe reference output for similarity comparison
metadataNoAdditional context (tags, categories, difficulty)

Example: Customer Support Dataset

[
  {
    "input": "How do I reset my password?",
    "expectedOutput": "Go to Settings > Security > Reset Password. You'll receive a confirmation email.",
    "metadata": { "category": "account", "difficulty": "easy" }
  },
  {
    "input": "I was charged twice for my subscription",
    "expectedOutput": "I'll look into the duplicate charge. Can you provide your account email and the date of the charges?",
    "metadata": { "category": "billing", "difficulty": "medium" }
  },
  {
    "input": "Compare the Pro and Team plans for a 15-person engineering team",
    "expectedOutput": null,
    "metadata": { "category": "sales", "difficulty": "hard" }
  }
]

Running Experiments

An experiment runs your agent against every item in a dataset and evaluates the results. Each experiment produces a set of scores you can compare against previous runs.

Via the Dashboard

  1. Go to Datasets → Select dataset → Run Experiment
  2. Choose which evaluators to run
  3. Review results — each item shows its scores alongside the input, output, and expected output

Via the CLI

# Run all evaluators against a dataset
2signal eval run --project my-agent --dataset golden-tests

# Run specific evaluators
2signal eval run --project my-agent --dataset golden-tests \
  --evaluators contains,llm_judge

# Fail if average score drops below threshold
2signal eval run --project my-agent --dataset golden-tests \
  --fail-below 0.85

Via the API

POST /api/v1/scores
{
  "trace_id": "trace-from-dataset-run",
  "evaluator_name": "custom_accuracy",
  "value": 0.95,
  "label": "pass",
  "reasoning": "Response matches expected output with minor formatting differences"
}

Evaluators for Datasets

Some evaluators are especially useful with datasets:

EvaluatorBest ForNeeds expectedOutput?
SimilarityComparing against golden outputsYes
ContainsChecking for required contentNo
JSON SchemaValidating output structureNo
LLM JudgeSemantic quality assessmentNo (but helps)

Comparing Experiments

After running multiple experiments — for example, after changing a prompt or switching models — compare the results side by side in the dashboard. Look for:

  • Score distribution changes — Did the average go up or down?
  • Per-item regressions — Which specific test cases got worse?
  • Cost/latency tradeoffs — Did you save money at the expense of quality?

Best Practices

  • Start small — 20–50 test cases covering your most important scenarios
  • Include edge cases — The inputs that have caused real production failures
  • Version your datasets — As your agent evolves, your test cases should too
  • Use metadata — Tag items by category and difficulty so you can analyze results by segment
  • Run on every deploy — Integrate with CI/CD to make experiments automatic

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord