Dataset Evaluation
Datasets are curated collections of inputs and expected outputs. Run evaluators against them to measure agent quality before deploying.
Step 1: Create a dataset
Via the dashboard: Go to your project's Datasets page, create a new dataset, and add items. Each item has:
input— the test input (string or JSON)expectedOutput— what the agent should produce (used by Similarity, Contains, etc.)metadata— optional tags for filtering
Step 2: Design your test cases
Recommended test case categories:
| Category | Count | Examples |
|---|---|---|
| Happy path | 10-15 | Common, well-formed user queries |
| Edge cases | 5-10 | Very long inputs, empty inputs, special characters |
| Failure modes | 5-10 | Out-of-scope requests, adversarial inputs |
| Multi-language | 2-5 | If your agent handles multiple languages |
| Format-specific | 5-10 | Inputs that should trigger structured output |
A good regression dataset has 25-50 items minimum.
Step 3: Run evaluations
Via the dashboard: Select your dataset, choose evaluators, and click Run Evaluation. Results show per-item scores, aggregate pass rate, and average scores.
Via the CLI:
twosignal eval run \
--dataset regression-tests \
--evaluators helpfulness,format-check,latency-sla \
--output results.jsonStep 4: Interpret results
Example output:
| Item | helpfulness | format-check | latency-sla | Overall |
|---|---|---|---|---|
| "What's your return policy?" | 0.90 | 1.0 | 1.0 | PASS |
| "Tell me about..." (vague) | 0.60 | 1.0 | 0.85 | WARN |
| "" (empty input) | 0.30 | 0.0 | 1.0 | FAIL |
Focus on items that fail. These are your regression candidates — the inputs most likely to break when you change your agent.
Step 5: Track baselines over time
Each eval run is saved with a timestamp. Compare runs to detect quality drift:
- Before a deploy: run your dataset, record the aggregate score as baseline
- After a change: run again and compare
- If scores drop: investigate which test cases regressed
Step 6: Integrate with CI/CD
Automate dataset evaluations in your deployment pipeline. See the CI/CD Evaluation cookbook for a complete walkthrough.
Tips
- Start small (20 items) and grow your dataset as you discover real-world failure modes
- Include items from production traces that caused issues — these are your best test cases
- Update expected outputs when your agent's behavior intentionally changes
- Run the full dataset weekly even if you're not deploying — model provider changes can cause regressions
See the Datasets guide for full reference.