Testing

Dataset Evaluation

Datasets are curated collections of inputs and expected outputs. Run evaluators against them to measure agent quality before deploying.

Step 1: Create a dataset

Via the dashboard: Go to your project's Datasets page, create a new dataset, and add items. Each item has:

input — the test input (string or JSON)
expectedOutput — what the agent should produce (used by Similarity, Contains, etc.)
metadata — optional tags for filtering

Step 2: Design your test cases

Recommended test case categories:

Category	Count	Examples
Happy path	10-15	Common, well-formed user queries
Edge cases	5-10	Very long inputs, empty inputs, special characters
Failure modes	5-10	Out-of-scope requests, adversarial inputs
Multi-language	2-5	If your agent handles multiple languages
Format-specific	5-10	Inputs that should trigger structured output

A good regression dataset has 25-50 items minimum.

Step 3: Run evaluations

Via the dashboard: Select your dataset, choose evaluators, and click Run Evaluation. Results show per-item scores, aggregate pass rate, and average scores.

Via the CLI:

twosignal eval run \
  --dataset regression-tests \
  --evaluators helpfulness,format-check,latency-sla \
  --output results.json

Step 4: Interpret results

Example output:

Item	helpfulness	format-check	latency-sla	Overall
"What's your return policy?"	0.90	1.0	1.0	PASS
"Tell me about..." (vague)	0.60	1.0	0.85	WARN
"" (empty input)	0.30	0.0	1.0	FAIL

Focus on items that fail. These are your regression candidates — the inputs most likely to break when you change your agent.

Step 5: Track baselines over time

Each eval run is saved with a timestamp. Compare runs to detect quality drift:

Before a deploy: run your dataset, record the aggregate score as baseline
After a change: run again and compare
If scores drop: investigate which test cases regressed

Step 6: Integrate with CI/CD

Automate dataset evaluations in your deployment pipeline. See the CI/CD Evaluation cookbook for a complete walkthrough.

Tips

Start small (20 items) and grow your dataset as you discover real-world failure modes
Include items from production traces that caused issues — these are your best test cases
Update expected outputs when your agent's behavior intentionally changes
Run the full dataset weekly even if you're not deploying — model provider changes can cause regressions

See the Datasets guide for full reference.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord