Guide
Datasets & Experiments
Datasets are collections of test cases with known inputs and expected outputs. Use them to systematically evaluate your agent against a fixed set of scenarios and catch regressions before they reach production.
When to Use Datasets
- Regression testing — Verify your agent still handles known scenarios correctly after changes
- Model comparison — Run the same test cases against different models to compare quality and cost
- Prompt iteration — Measure the impact of prompt changes against a consistent benchmark
- CI/CD gates — Block deploys when evaluation scores drop below thresholds
Creating a Dataset
Create datasets in the dashboard under your project:
Dashboard → Project → Datasets → Create Dataset
Each dataset contains items with:
| Field | Required | Description |
|---|---|---|
input | Yes | The input to send to your agent |
expectedOutput | No | The reference output for similarity comparison |
metadata | No | Additional context (tags, categories, difficulty) |
Example: Customer Support Dataset
[
{
"input": "How do I reset my password?",
"expectedOutput": "Go to Settings > Security > Reset Password. You'll receive a confirmation email.",
"metadata": { "category": "account", "difficulty": "easy" }
},
{
"input": "I was charged twice for my subscription",
"expectedOutput": "I'll look into the duplicate charge. Can you provide your account email and the date of the charges?",
"metadata": { "category": "billing", "difficulty": "medium" }
},
{
"input": "Compare the Pro and Team plans for a 15-person engineering team",
"expectedOutput": null,
"metadata": { "category": "sales", "difficulty": "hard" }
}
]Running Experiments
An experiment runs your agent against every item in a dataset and evaluates the results. Each experiment produces a set of scores you can compare against previous runs.
Via the Dashboard
- Go to Datasets → Select dataset → Run Experiment
- Choose which evaluators to run
- Review results — each item shows its scores alongside the input, output, and expected output
Via the CLI
# Run all evaluators against a dataset
2signal eval run --project my-agent --dataset golden-tests
# Run specific evaluators
2signal eval run --project my-agent --dataset golden-tests \
--evaluators contains,llm_judge
# Fail if average score drops below threshold
2signal eval run --project my-agent --dataset golden-tests \
--fail-below 0.85Via the API
POST /api/v1/scores
{
"trace_id": "trace-from-dataset-run",
"evaluator_name": "custom_accuracy",
"value": 0.95,
"label": "pass",
"reasoning": "Response matches expected output with minor formatting differences"
}Evaluators for Datasets
Some evaluators are especially useful with datasets:
| Evaluator | Best For | Needs expectedOutput? |
|---|---|---|
| Similarity | Comparing against golden outputs | Yes |
| Contains | Checking for required content | No |
| JSON Schema | Validating output structure | No |
| LLM Judge | Semantic quality assessment | No (but helps) |
Comparing Experiments
After running multiple experiments — for example, after changing a prompt or switching models — compare the results side by side in the dashboard. Look for:
- Score distribution changes — Did the average go up or down?
- Per-item regressions — Which specific test cases got worse?
- Cost/latency tradeoffs — Did you save money at the expense of quality?
Best Practices
- Start small — 20–50 test cases covering your most important scenarios
- Include edge cases — The inputs that have caused real production failures
- Version your datasets — As your agent evolves, your test cases should too
- Use metadata — Tag items by category and difficulty so you can analyze results by segment
- Run on every deploy — Integrate with CI/CD to make experiments automatic