Best Practices for Agent Testing
Practical strategies for testing AI agents at every stage — from development to production monitoring.
The Evaluation Pyramid
Like the traditional testing pyramid, agent evaluation has layers. Run more of the cheap, fast checks and fewer of the expensive ones:
| Layer | Evaluators | Cost | Coverage |
|---|---|---|---|
| Base: Structural | Contains, Regex, JSON Schema | Free | Every trace |
| Middle: Operational | Latency, Cost | Free | Every trace |
| Top: Semantic | LLM Judge, Similarity | ~$0.001/eval | Every trace or sampled |
Development Phase
Start with the happy path
Before testing edge cases, make sure your agent handles the 10 most common requests correctly. Create a dataset with these core scenarios and run evaluators against them.
Test failure modes explicitly
Include test cases that should trigger specific behaviors:
- Inputs that should be rejected (out of scope, harmful)
- Inputs with missing context (the agent should ask for clarification)
- Inputs in unexpected formats (typos, multiple languages, very long)
- Ambiguous inputs (the agent should handle uncertainty gracefully)
Use LLM Judge for subjective quality
Write clear, specific criteria. Bad: "Is the response good?" Good: "Does the response directly answer the user's question, provide actionable next steps, and avoid making promises the company can't keep?"
Pre-Production Phase
Set up CI/CD evaluation
Run your dataset evaluations in CI. Fail the build if average scores drop:
# GitHub Actions example
- name: Run agent evaluation
run: |
2signal eval run \
--project my-agent \
--dataset regression-tests \
--fail-below 0.85Establish baselines
Before deploying, record baseline scores for each evaluator. Future experiments should compare against these baselines, not arbitrary thresholds.
Test with real-world inputs
If possible, replay anonymized production inputs through your agent and evaluate the results. Synthetic test cases always miss the weird things real users do.
Production Phase
Evaluate every trace
Structural and operational evaluators (Contains, Regex, JSON Schema, Latency, Cost) are free and instant. Run them on 100% of production traces.
Sample expensive evaluators
LLM Judge costs ~$0.001 per evaluation. At high volume, sample a percentage:
- < 1K traces/day — evaluate all
- 1K–10K traces/day — sample 20–50%
- > 10K traces/day — sample 5–10%
Monitor score trends, not individual scores
Individual trace scores will vary. What matters is the trend. A gradual decline in average LLM Judge scores over a week means something has changed — even if no one deployed new code.
Set up alerts
Configure alerts for:
- Average score dropping below threshold over a time window
- Error rate exceeding baseline
- Cost per trace exceeding budget
- Latency exceeding SLA
Common Anti-Patterns
Testing only the happy path
If your test dataset only contains clean, well-formatted inputs, you're not testing your agent — you're testing the demo. Include messy, adversarial, and edge-case inputs.
Using vague evaluation criteria
"Is the response good?" means different things to different LLMs. Be specific about what "good" means in your context.
Ignoring cost until the bill arrives
Add a Cost evaluator from day one. It's free to run and will surface cost anomalies before they compound.
Testing in isolation
Evaluators are most powerful in combination. A response that passes Contains, JSON Schema, and LLM Judge while staying under cost and latency thresholds is genuinely reliable.
One-time evaluation
Running evaluators once before launch and never again is like running tests once and deleting them. AI agents drift. Continuous evaluation is the point.
Evaluation Checklist
Before deploying a new agent or a significant change:
- Dataset with 20+ test cases covering core scenarios and edge cases
- At least one structural evaluator (Contains, JSON Schema, or Regex)
- Cost and Latency evaluators with thresholds based on your SLA
- LLM Judge with specific, measurable criteria
- Baseline scores recorded from the current version
- CI/CD integration that blocks deploys on score regression
- Production monitoring with 100% coverage for free evaluators