Testing

Dataset Evaluation

Datasets are curated collections of inputs and expected outputs. Run evaluators against them to measure agent quality before deploying.

Step 1: Create a dataset

Via the dashboard: Go to your project's Datasets page, create a new dataset, and add items. Each item has:

  • input — the test input (string or JSON)
  • expectedOutput — what the agent should produce (used by Similarity, Contains, etc.)
  • metadata — optional tags for filtering

Step 2: Design your test cases

Recommended test case categories:

CategoryCountExamples
Happy path10-15Common, well-formed user queries
Edge cases5-10Very long inputs, empty inputs, special characters
Failure modes5-10Out-of-scope requests, adversarial inputs
Multi-language2-5If your agent handles multiple languages
Format-specific5-10Inputs that should trigger structured output

A good regression dataset has 25-50 items minimum.

Step 3: Run evaluations

Via the dashboard: Select your dataset, choose evaluators, and click Run Evaluation. Results show per-item scores, aggregate pass rate, and average scores.

Via the CLI:

twosignal eval run \
  --dataset regression-tests \
  --evaluators helpfulness,format-check,latency-sla \
  --output results.json

Step 4: Interpret results

Example output:

Itemhelpfulnessformat-checklatency-slaOverall
"What's your return policy?"0.901.01.0PASS
"Tell me about..." (vague)0.601.00.85WARN
"" (empty input)0.300.01.0FAIL

Focus on items that fail. These are your regression candidates — the inputs most likely to break when you change your agent.

Step 5: Track baselines over time

Each eval run is saved with a timestamp. Compare runs to detect quality drift:

  • Before a deploy: run your dataset, record the aggregate score as baseline
  • After a change: run again and compare
  • If scores drop: investigate which test cases regressed

Step 6: Integrate with CI/CD

Automate dataset evaluations in your deployment pipeline. See the CI/CD Evaluation cookbook for a complete walkthrough.

Tips

  • Start small (20 items) and grow your dataset as you discover real-world failure modes
  • Include items from production traces that caused issues — these are your best test cases
  • Update expected outputs when your agent's behavior intentionally changes
  • Run the full dataset weekly even if you're not deploying — model provider changes can cause regressions

See the Datasets guide for full reference.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord