7 Evaluator Patterns for Production AI Agents

Running a single evaluator on your traces is like having a single test for your entire application. It might catch the obvious failures, but the subtle ones — the ones that actually hurt users — slip right through.

The real power of evaluation comes from combining multiple evaluators into a pipeline. Here are seven patterns we see in production deployments.

Pattern 1: The Safety Net

Use case: Any agent that generates user-facing text.

Combine CONTAINS (negated) with LLM_JUDGE. The Contains evaluator runs first as a fast, cheap check for forbidden content — competitor names, profanity, confidential terms. The LLM Judge runs second to evaluate tone, helpfulness, and accuracy.

# Fast guardrail: block forbidden content
CONTAINS: { value: ["competitor_name", "confidential"], negate: true }

# Semantic quality check
LLM_JUDGE: { criteria: "Is the response helpful, accurate, and professional?" }

Pattern 2: The Structured Output Validator

Use case: Agents that return JSON, tool calls, or structured data.

Chain JSON_SCHEMA with CONTAINS. First validate the structure is correct, then check that required values are present in the response.

# Validate structure
JSON_SCHEMA: {
  schema: {
    type: "object",
    required: ["action", "reasoning"],
    properties: {
      action: { type: "string", enum: ["approve", "reject", "escalate"] },
      reasoning: { type: "string", minLength: 20 }
    }
  }
}

# Validate content
CONTAINS: { value: ["action"], mode: "all" }

Pattern 3: The Budget Enforcer

Use case: High-volume agents where cost matters.

Combine COST and LATENCY to catch performance regressions before they become expensive. Set tight thresholds based on your SLA.

# Per-request cost ceiling
COST: { max_cost_usd: 0.05, target_cost_usd: 0.02 }

# Response time SLA
LATENCY: { max_ms: 5000, target_ms: 2000 }

When either evaluator starts producing low scores, you know your agent is drifting — whether from prompt bloat, unnecessary tool calls, or model changes.

Pattern 4: The Regression Detector

Use case: Agents with known expected outputs (golden datasets).

Use SIMILARITY against a reference dataset. When similarity drops below your threshold, something has changed in your agent's behavior.

# Compare against golden outputs
SIMILARITY: { threshold: 0.8, tokenizer: "word" }

# Provide expected outputs via datasets
# Dashboard → Datasets → Create → Add test cases

Pattern 5: The Format Police

Use case: Agents that need to follow specific output formats.

Use REGEX_MATCH to validate format patterns — dates, IDs, URLs, code blocks, or any structured text.

# Must include a ticket ID
REGEX_MATCH: { pattern: "TICKET-\\d{4,}" }

# Must NOT include raw SQL
REGEX_MATCH: { pattern: "SELECT.*FROM", negate: true }

Pattern 6: The Customer Support Stack

Use case: Customer-facing support agents.

This is the most common production pattern. It combines all layers:

# Layer 1: Guardrails (instant, free)
CONTAINS: { value: ["I don't know", "as an AI"], negate: true }
REGEX_MATCH: { pattern: "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b", negate: true }

# Layer 2: Quality (fast, free)
LATENCY: { max_ms: 8000, target_ms: 3000 }
COST: { max_cost_usd: 0.10, target_cost_usd: 0.03 }

# Layer 3: Semantic ($0.001/eval)
LLM_JUDGE: {
  criteria: "Does the response directly address the customer's question? Is it empathetic and actionable?",
  scale: "1-5"
}

Pattern 7: The CI/CD Gate

Use case: Teams that want to block deploys on quality regressions.

Run evaluators against a test dataset in your CI pipeline using the 2signal CLI. Fail the build if scores drop below thresholds.

# .github/workflows/eval.yml
- name: Run evaluation suite
  run: |
    2signal eval run \
      --project my-agent \
      --dataset golden-tests \
      --evaluators contains,json_schema,llm_judge \
      --fail-below 0.85

Choosing the Right Pattern

| Scenario | Recommended Pattern | |---|---| | Customer-facing chatbot | Pattern 6 (Support Stack) | | API that returns JSON | Pattern 2 (Structured Output) | | High-volume, cost-sensitive | Pattern 3 (Budget Enforcer) | | CI/CD pipeline | Pattern 7 (CI/CD Gate) | | Content generation | Pattern 1 (Safety Net) | | Known-answer testing | Pattern 4 (Regression Detector) | | Strict format requirements | Pattern 5 (Format Police) |

Ready to set up your evaluation pipeline? Start with the evaluators overview or jump to a specific evaluator: LLM Judge, Contains, JSON Schema.