Evaluators

Evaluators score your traces automatically after ingestion. Configure them in the dashboard per project — they run asynchronously via background workers and never slow down your agent.

Built-in Evaluators

Deterministic

TypeDescriptionScore
ContainsCheck if output contains strings0 or 1
Regex MatchMatch output against regex0 or 1
JSON SchemaValidate output structure0 or 1
Exact MatchExact string comparison0 or 1
Starts WithCheck output prefix0 or 1
Ends WithCheck output suffix0 or 1
LengthOutput length constraints0 or 1
LevenshteinEdit distance similarity0–1
SimilarityTF-IDF cosine similarity0–1
LatencyResponse time scoring0–1
CostPer-trace cost scoring0–1
Response Time SLALatency SLA enforcement0 or 1
SentimentKeyword-based sentiment0–1
ToxicityBlocklist-based toxicity check0 or 1
Tool Call ValidationValidate tool/function calls0 or 1

LLM-based

TypeDescriptionScore
LLM JudgeLLM-scored output quality0–1
GroundednessHallucination detection0–1
Prompt InjectionInjection attack detection0 or 1
Bias DetectionBias across demographics0–1
Compliance CheckRegulatory compliance0–1
Factual AccuracyFact verification0–1
PII DetectionPII leak detection0 or 1
Workflow AdherenceWorkflow step validation0–1

External

TypeDescriptionScore
WebhookExternal HTTP evaluator endpoint0–1
CustomYour own logic via POST /api/v1/scores0–1

How Evaluators Work

  1. You configure evaluators in the dashboard (name, type, config)
  2. When a trace is ingested, all enabled evaluators for that project are triggered
  3. Evaluators run asynchronously in BullMQ workers
  4. Each evaluator produces a score (0–1) with an optional label and reasoning
  5. Scores appear on the trace in the dashboard

Score Format

Every evaluator returns:

FieldTypeDescription
scorenumber0–1 (0 = fail, 1 = perfect)
labelstringOptional: "pass" or "fail"
reasoningstringOptional: explanation

Creating an Evaluator

In the dashboard, go to your project → Evaluators → Create. Select the evaluator type and configure it. Example for a Contains evaluator:

{
  "name": "mentions-refund-policy",
  "type": "CONTAINS",
  "config": {
    "value": "refund policy",
    "case_sensitive": false
  }
}

Custom Evaluators via API

If none of the built-in evaluators fit your use case, you can run your own evaluation logic externally and submit scores via the REST API. This is useful for custom ML models, domain-specific heuristics, or any scoring pipeline you manage yourself.

Send a POST request to /api/v1/scores with your API key and the score payload. Each score is attached to a specific trace by its ID.

import httpx

client = httpx.Client(
    base_url="https://app.2signal.dev",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)

# Submit a custom score for a trace
response = client.post("/api/v1/scores", json={
    "trace_id": "abc-123-def-456",
    "name": "domain-accuracy",
    "score": 0.92,
    "label": "pass",
    "reasoning": "Output correctly referenced 11 of 12 domain entities."
})

print(response.json())
# {"data": {"id": "score-789", "trace_id": "abc-123-def-456", ...}, "error": null}

You can call this endpoint from CI pipelines, post-processing scripts, or any service that has access to your traces. The score will appear alongside built-in evaluator scores in the dashboard.

The response follows the standard 2Signal format: { data: ..., error: null } on success, or { data: null, error: { message, code } } on failure. Common status codes are 201 (created), 400 (validation error), 401 (invalid API key), and 429 (rate limited).

Evaluator Lifecycle

Every evaluator follows a five-stage lifecycle from creation to results:

  1. Create — Define the evaluator in the dashboard (or via API). Give it a name, select a type, and provide configuration. The evaluator is saved but not yet active.
  2. Enable — Toggle the evaluator on for a project. Only enabled evaluators are triggered when traces arrive. You can disable an evaluator at any time without deleting it, which preserves historical scores.
  3. Trigger — When a new trace is ingested, the trace-writer worker checks for all enabled evaluators on that project and enqueues an eval job per evaluator into the BullMQ queue.
  4. Score — The eval-runner worker picks up the job, runs the evaluator logic against the trace data, and produces a score (0–1), an optional label, and optional reasoning. The score is persisted to PostgreSQL via the score-writer.
  5. View — Scores appear on the trace detail page in the dashboard. You can filter traces by evaluator scores, track score distributions over time, and set up alerts when scores drop below thresholds.

Disabling an evaluator stops it from running on new traces but does not delete existing scores. Re-enabling it will only score traces ingested after re-enablement — it does not backfill.

Choosing the Right Evaluator

Use this guide to pick the evaluator that best fits your needs. You can (and should) combine multiple evaluators for comprehensive coverage.

If you need to...UseWhy
Assess overall output quality or toneLLM JudgeOnly an LLM can evaluate nuanced quality, helpfulness, or adherence to style guidelines
Verify specific keywords or phrases appear in outputContainsFast, deterministic check — ideal for compliance phrases, disclaimers, or required terms
Validate output matches a pattern (emails, IDs, formats)Regex MatchFlexible pattern matching for structured substrings without requiring full schema validation
Ensure output is valid JSON with a specific structureJSON SchemaGuarantees your agent returns properly structured data — critical for tool-use agents
Compare output against a reference answerSimilarityTF-IDF cosine similarity gives a 0–1 score without needing an LLM call
Enforce response time SLAsLatencyScore traces based on how fast they complete — flag slow responses automatically
Track and cap per-trace spendingCostScores based on token cost — catch expensive traces before they blow your budget
Run custom domain-specific logicCustom (via API)Submit scores from your own pipeline using POST /api/v1/scores

Performance

Evaluators are designed to add zero latency to your trace ingestion pipeline. When a trace is submitted via POST /api/v1/traces, the API returns immediately after persisting the raw data to S3. Evaluators run asynchronously in background workers — your agent never waits for scoring.

Execution times vary by evaluator type:

CategoryEvaluatorsTypical Latency
DeterministicContains, Regex Match, JSON Schema, Latency, Cost<1ms per trace
ComputationalSimilarity (TF-IDF)5–50ms per trace (depends on output length)
LLM-basedLLM Judge1–5s per trace (depends on model and prompt length)

Since all evaluators run in BullMQ workers, even LLM Judge latency has no impact on your ingestion throughput. Workers process eval jobs concurrently, so multiple evaluators on the same trace run in parallel.

Evaluator Combinations

The most effective setups use multiple evaluators together. Each evaluator covers a different dimension of quality, giving you comprehensive visibility into your agent's behavior.

Here is a recommended combination for a customer-support agent:

// Project evaluator setup — customer support agent

// 1. Structure: ensure the agent returns valid JSON for downstream systems
{
  "name": "valid-response-format",
  "type": "JSON_SCHEMA",
  "config": {
    "schema": {
      "type": "object",
      "required": ["answer", "confidence", "sources"],
      "properties": {
        "answer": { "type": "string" },
        "confidence": { "type": "number", "minimum": 0, "maximum": 1 },
        "sources": { "type": "array", "items": { "type": "string" } }
      }
    }
  }
}

// 2. Compliance: check that required disclaimers are present
{
  "name": "includes-disclaimer",
  "type": "CONTAINS",
  "config": {
    "value": "This is not legal advice",
    "case_sensitive": false
  }
}

// 3. Cost: flag traces that cost more than $0.10
{
  "name": "cost-guard",
  "type": "COST",
  "config": {
    "max_cost": 0.10
  }
}

// 4. Latency: flag traces slower than 5 seconds
{
  "name": "latency-sla",
  "type": "LATENCY",
  "config": {
    "max_latency_ms": 5000
  }
}

// 5. Quality: LLM Judge for overall helpfulness
{
  "name": "helpfulness",
  "type": "LLM_JUDGE",
  "config": {
    "prompt": "Rate the helpfulness of this customer support response. Score 1 if the answer is accurate, complete, and actionable. Score 0 if it is vague, incorrect, or unhelpful.",
    "model": "gpt-4o"
  }
}

With this setup, every trace gets five scores. You can then filter in the dashboard for traces where any evaluator scored below a threshold — for example, all traces where helpfulness < 0.7 or cost-guard = 0 — to quickly find problem areas.

Managing Evaluators

You can manage evaluators through the dashboard, CLI, or TUI. All three surfaces support creating, listing, enabling, disabling, and deleting evaluators.

Dashboard

Navigate to your project, then click Evaluators in the sidebar. Use the Create button to add a new evaluator, or toggle the Enabled switch on existing ones. Click an evaluator name to edit its configuration.

CLI

# List all evaluators for a project
2signal evaluators list --project my-project

# Create a new evaluator
2signal evaluators create --project my-project \
  --name "output-check" \
  --type CONTAINS \
  --config '{"value": "thank you", "case_sensitive": false}'

# Enable/disable an evaluator
2signal evaluators enable --project my-project --name "output-check"
2signal evaluators disable --project my-project --name "output-check"

# Delete an evaluator (scores are preserved)
2signal evaluators delete --project my-project --name "output-check"

TUI

Launch the TUI with 2signal tui, navigate to the Evaluators panel, and use the keyboard shortcuts displayed at the bottom of the screen. The TUI provides a real-time view of evaluator status and recent scores.

Limits

The following limits apply to evaluators. These are per-project unless otherwise noted.

ResourceFree TierPro TierEnterprise
Evaluators per project525Unlimited
Concurrent eval jobs (per project)21050
LLM Judge evaluators per project15Unlimited
Max eval config size10 KB
Score retention7 days90 daysUnlimited
Custom scores via API (per month)1,00050,000Unlimited

If you exceed the evaluator limit for your tier, new evaluator creation will return a 400 error with the message evaluator_limit_reached. Upgrade your plan or disable unused evaluators to free up slots.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord