← Back to Blog
March 15, 2026·8 min read

Traditional software is deterministic. Given the same input, you get the same output. You write unit tests, they pass or fail, and you ship with confidence.

AI agents are different. The same input can produce wildly different outputs depending on the model, the prompt, the temperature, and even the time of day. Your agent might work perfectly in development and hallucinate in production. It might pass every test case you wrote and fail on the first real user query you didn't think of.

This is why AI agent testing requires a fundamentally different approach.

The Three Failure Modes

AI agents fail in three ways that traditional testing doesn't catch:

1. Regression Without Code Changes

Your agent's behavior can change without anyone touching the code. Model providers update weights, fine-tuning data shifts, and suddenly your carefully tuned prompts produce different results. Without continuous evaluation, you won't know until a user complains.

2. Semantic Correctness vs. Syntactic Correctness

Your agent might return valid JSON, respond in the right format, and hit every structural check — while being completely wrong about the content. Traditional assertions can't catch an agent that confidently returns incorrect information in a perfectly formatted response.

3. Cost and Latency Drift

An agent that works correctly but takes 30 seconds and costs $0.50 per request is still broken. Cost and latency aren't just operational concerns — they're quality metrics that need to be evaluated alongside correctness.

Building a Testing Strategy

A robust AI agent testing strategy combines three layers:

Layer 1: Deterministic Checks

Start with the things you can verify exactly. Does the output contain required fields? Does it match expected patterns? Is the JSON valid? These are fast, cheap, and catch obvious breakages.

# Evaluators: CONTAINS, REGEX_MATCH, JSON_SCHEMA
# Run on every trace, zero cost, instant results

Layer 2: Semantic Evaluation

Use LLM-as-judge to evaluate things humans care about: Is the response helpful? Is it accurate? Does it follow the guidelines? This catches the failures that structural checks miss.

# Evaluator: LLM_JUDGE
# Run on every trace (or sample in high-volume)
# Cost: ~$0.001 per eval with gpt-4o-mini

Layer 3: Operational Thresholds

Set bounds on latency and cost. An agent that's correct but costs 10x what it should is still a problem. These evaluators catch drift before it hits your bill.

# Evaluators: LATENCY, COST
# Run on every trace, zero cost, instant results

Continuous vs. One-Shot Testing

The biggest mistake teams make is treating AI agent testing like traditional test suites — run once before deploy, then forget about it. AI agents need continuous evaluation because:

  • Model behavior changes over time (even without code changes)
  • Real user inputs are more diverse than any test dataset
  • Edge cases emerge gradually as usage scales
  • Cost and latency profiles shift with traffic patterns

The right approach is to evaluate every trace in production. Deterministic checks and threshold evaluators are essentially free. LLM-as-judge can be sampled if cost is a concern.

Getting Started with 2signal

2signal makes this straightforward. Instrument your agent with the Python SDK, configure evaluators in the dashboard, and every trace is automatically scored:

from twosignal import TwoSignal, observe
from twosignal.wrappers import wrap_openai
from openai import OpenAI

ts = TwoSignal()
client = wrap_openai(OpenAI())

@observe
def support_agent(query: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )
    return response.choices[0].message.content

That's it. Every call is traced, costs are tracked, and your configured evaluators run automatically in the background. No test files to maintain, no CI pipeline to configure (though you can add that too via the CLI).

What to Evaluate First

If you're starting from zero, here's the priority order:

  1. JSON Schema — if your agent returns structured data, validate it
  2. Contains — check for required content or forbidden patterns
  3. Cost — set a per-request budget before you get a surprise bill
  4. Latency — set SLA thresholds based on your use case
  5. LLM Judge — evaluate helpfulness, accuracy, or tone

Start simple, measure everything, and add complexity as you learn where your agent actually fails.


Ready to start? Read the Getting Started guide or explore the evaluators.