Running a single evaluator on your traces is like having a single test for your entire application. It might catch the obvious failures, but the subtle ones — the ones that actually hurt users — slip right through.
The real power of evaluation comes from combining multiple evaluators into a pipeline. Here are seven patterns we see in production deployments.
Pattern 1: The Safety Net
Use case: Any agent that generates user-facing text.
Combine CONTAINS (negated) with LLM_JUDGE. The Contains evaluator runs first as a fast, cheap check for forbidden content — competitor names, profanity, confidential terms. The LLM Judge runs second to evaluate tone, helpfulness, and accuracy.
# Fast guardrail: block forbidden content
CONTAINS: { value: ["competitor_name", "confidential"], negate: true }
# Semantic quality check
LLM_JUDGE: { criteria: "Is the response helpful, accurate, and professional?" }
Pattern 2: The Structured Output Validator
Use case: Agents that return JSON, tool calls, or structured data.
Chain JSON_SCHEMA with CONTAINS. First validate the structure is correct, then check that required values are present in the response.
# Validate structure
JSON_SCHEMA: {
schema: {
type: "object",
required: ["action", "reasoning"],
properties: {
action: { type: "string", enum: ["approve", "reject", "escalate"] },
reasoning: { type: "string", minLength: 20 }
}
}
}
# Validate content
CONTAINS: { value: ["action"], mode: "all" }
Pattern 3: The Budget Enforcer
Use case: High-volume agents where cost matters.
Combine COST and LATENCY to catch performance regressions before they become expensive. Set tight thresholds based on your SLA.
# Per-request cost ceiling
COST: { max_cost_usd: 0.05, target_cost_usd: 0.02 }
# Response time SLA
LATENCY: { max_ms: 5000, target_ms: 2000 }
When either evaluator starts producing low scores, you know your agent is drifting — whether from prompt bloat, unnecessary tool calls, or model changes.
Pattern 4: The Regression Detector
Use case: Agents with known expected outputs (golden datasets).
Use SIMILARITY against a reference dataset. When similarity drops below your threshold, something has changed in your agent's behavior.
# Compare against golden outputs
SIMILARITY: { threshold: 0.8, tokenizer: "word" }
# Provide expected outputs via datasets
# Dashboard → Datasets → Create → Add test cases
Pattern 5: The Format Police
Use case: Agents that need to follow specific output formats.
Use REGEX_MATCH to validate format patterns — dates, IDs, URLs, code blocks, or any structured text.
# Must include a ticket ID
REGEX_MATCH: { pattern: "TICKET-\\d{4,}" }
# Must NOT include raw SQL
REGEX_MATCH: { pattern: "SELECT.*FROM", negate: true }
Pattern 6: The Customer Support Stack
Use case: Customer-facing support agents.
This is the most common production pattern. It combines all layers:
# Layer 1: Guardrails (instant, free)
CONTAINS: { value: ["I don't know", "as an AI"], negate: true }
REGEX_MATCH: { pattern: "\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b", negate: true }
# Layer 2: Quality (fast, free)
LATENCY: { max_ms: 8000, target_ms: 3000 }
COST: { max_cost_usd: 0.10, target_cost_usd: 0.03 }
# Layer 3: Semantic ($0.001/eval)
LLM_JUDGE: {
criteria: "Does the response directly address the customer's question? Is it empathetic and actionable?",
scale: "1-5"
}
Pattern 7: The CI/CD Gate
Use case: Teams that want to block deploys on quality regressions.
Run evaluators against a test dataset in your CI pipeline using the 2signal CLI. Fail the build if scores drop below thresholds.
# .github/workflows/eval.yml
- name: Run evaluation suite
run: |
2signal eval run \
--project my-agent \
--dataset golden-tests \
--evaluators contains,json_schema,llm_judge \
--fail-below 0.85
Choosing the Right Pattern
| Scenario | Recommended Pattern | |---|---| | Customer-facing chatbot | Pattern 6 (Support Stack) | | API that returns JSON | Pattern 2 (Structured Output) | | High-volume, cost-sensitive | Pattern 3 (Budget Enforcer) | | CI/CD pipeline | Pattern 7 (CI/CD Gate) | | Content generation | Pattern 1 (Safety Net) | | Known-answer testing | Pattern 4 (Regression Detector) | | Strict format requirements | Pattern 5 (Format Police) |
Ready to set up your evaluation pipeline? Start with the evaluators overview or jump to a specific evaluator: LLM Judge, Contains, JSON Schema.