Guide

Best Practices for Agent Testing

Practical strategies for testing AI agents at every stage — from development to production monitoring.

The Evaluation Pyramid

Like the traditional testing pyramid, agent evaluation has layers. Run more of the cheap, fast checks and fewer of the expensive ones:

Layer	Evaluators	Cost	Coverage
Base: Structural	Contains, Regex, JSON Schema	Free	Every trace
Middle: Operational	Latency, Cost	Free	Every trace
Top: Semantic	LLM Judge, Similarity	~$0.001/eval	Every trace or sampled

Development Phase

Start with the happy path

Before testing edge cases, make sure your agent handles the 10 most common requests correctly. Create a dataset with these core scenarios and run evaluators against them.

Test failure modes explicitly

Include test cases that should trigger specific behaviors:

Inputs that should be rejected (out of scope, harmful)
Inputs with missing context (the agent should ask for clarification)
Inputs in unexpected formats (typos, multiple languages, very long)
Ambiguous inputs (the agent should handle uncertainty gracefully)

Use LLM Judge for subjective quality

Write clear, specific criteria. Bad: "Is the response good?" Good: "Does the response directly answer the user's question, provide actionable next steps, and avoid making promises the company can't keep?"

Pre-Production Phase

Set up CI/CD evaluation

Run your dataset evaluations in CI. Fail the build if average scores drop:

# GitHub Actions example
- name: Run agent evaluation
  run: |
    2signal eval run \
      --project my-agent \
      --dataset regression-tests \
      --fail-below 0.85

Establish baselines

Before deploying, record baseline scores for each evaluator. Future experiments should compare against these baselines, not arbitrary thresholds.

Test with real-world inputs

If possible, replay anonymized production inputs through your agent and evaluate the results. Synthetic test cases always miss the weird things real users do.

Production Phase

Evaluate every trace

Structural and operational evaluators (Contains, Regex, JSON Schema, Latency, Cost) are free and instant. Run them on 100% of production traces.

Sample expensive evaluators

LLM Judge costs ~$0.001 per evaluation. At high volume, sample a percentage:

< 1K traces/day — evaluate all
1K–10K traces/day — sample 20–50%
> 10K traces/day — sample 5–10%

Monitor score trends, not individual scores

Individual trace scores will vary. What matters is the trend. A gradual decline in average LLM Judge scores over a week means something has changed — even if no one deployed new code.

Set up alerts

Configure alerts for:

Average score dropping below threshold over a time window
Error rate exceeding baseline
Cost per trace exceeding budget
Latency exceeding SLA

Common Anti-Patterns

Testing only the happy path

If your test dataset only contains clean, well-formatted inputs, you're not testing your agent — you're testing the demo. Include messy, adversarial, and edge-case inputs.

Using vague evaluation criteria

"Is the response good?" means different things to different LLMs. Be specific about what "good" means in your context.

Ignoring cost until the bill arrives

Add a Cost evaluator from day one. It's free to run and will surface cost anomalies before they compound.

Testing in isolation

Evaluators are most powerful in combination. A response that passes Contains, JSON Schema, and LLM Judge while staying under cost and latency thresholds is genuinely reliable.

One-time evaluation

Running evaluators once before launch and never again is like running tests once and deleting them. AI agents drift. Continuous evaluation is the point.

Evaluation Checklist

Before deploying a new agent or a significant change:

Dataset with 20+ test cases covering core scenarios and edge cases
At least one structural evaluator (Contains, JSON Schema, or Regex)
Cost and Latency evaluators with thresholds based on your SLA
LLM Judge with specific, measurable criteria
Baseline scores recorded from the current version
CI/CD integration that blocks deploys on score regression
Production monitoring with 100% coverage for free evaluators

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord