Guide

Datasets & Experiments

Datasets are collections of test cases with known inputs and expected outputs. Use them to systematically evaluate your agent against a fixed set of scenarios and catch regressions before they reach production.

When to Use Datasets

Regression testing — Verify your agent still handles known scenarios correctly after changes
Model comparison — Run the same test cases against different models to compare quality and cost
Prompt iteration — Measure the impact of prompt changes against a consistent benchmark
CI/CD gates — Block deploys when evaluation scores drop below thresholds

Creating a Dataset

Create datasets in the dashboard under your project:

Dashboard → Project → Datasets → Create Dataset

Each dataset contains items with:

Field	Required	Description
`input`	Yes	The input to send to your agent
`expectedOutput`	No	The reference output for similarity comparison
`metadata`	No	Additional context (tags, categories, difficulty)

Example: Customer Support Dataset

[
  {
    "input": "How do I reset my password?",
    "expectedOutput": "Go to Settings > Security > Reset Password. You'll receive a confirmation email.",
    "metadata": { "category": "account", "difficulty": "easy" }
  },
  {
    "input": "I was charged twice for my subscription",
    "expectedOutput": "I'll look into the duplicate charge. Can you provide your account email and the date of the charges?",
    "metadata": { "category": "billing", "difficulty": "medium" }
  },
  {
    "input": "Compare the Pro and Team plans for a 15-person engineering team",
    "expectedOutput": null,
    "metadata": { "category": "sales", "difficulty": "hard" }
  }
]

Running Experiments

An experiment runs your agent against every item in a dataset and evaluates the results. Each experiment produces a set of scores you can compare against previous runs.

Via the Dashboard

Go to Datasets → Select dataset → Run Experiment
Choose which evaluators to run
Review results — each item shows its scores alongside the input, output, and expected output

Via the CLI

# Run all evaluators against a dataset
2signal eval run --project my-agent --dataset golden-tests

# Run specific evaluators
2signal eval run --project my-agent --dataset golden-tests \
  --evaluators contains,llm_judge

# Fail if average score drops below threshold
2signal eval run --project my-agent --dataset golden-tests \
  --fail-below 0.85

Via the API

POST /api/v1/scores
{
  "trace_id": "trace-from-dataset-run",
  "evaluator_name": "custom_accuracy",
  "value": 0.95,
  "label": "pass",
  "reasoning": "Response matches expected output with minor formatting differences"
}

Evaluators for Datasets

Some evaluators are especially useful with datasets:

Evaluator	Best For	Needs expectedOutput?
Similarity	Comparing against golden outputs	Yes
Contains	Checking for required content	No
JSON Schema	Validating output structure	No
LLM Judge	Semantic quality assessment	No (but helps)

Comparing Experiments

After running multiple experiments — for example, after changing a prompt or switching models — compare the results side by side in the dashboard. Look for:

Score distribution changes — Did the average go up or down?
Per-item regressions — Which specific test cases got worse?
Cost/latency tradeoffs — Did you save money at the expense of quality?

Best Practices

Start small — 20–50 test cases covering your most important scenarios
Include edge cases — The inputs that have caused real production failures
Version your datasets — As your agent evolves, your test cases should too
Use metadata — Tag items by category and difficulty so you can analyze results by segment
Run on every deploy — Integrate with CI/CD to make experiments automatic

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord