Evaluators

Evaluators score your traces automatically after ingestion. Configure them in the dashboard per project — they run asynchronously via background workers and never slow down your agent.

Built-in Evaluators

Deterministic

Type	Description	Score
Contains	Check if output contains strings	0 or 1
Regex Match	Match output against regex	0 or 1
JSON Schema	Validate output structure	0 or 1
Exact Match	Exact string comparison	0 or 1
Starts With	Check output prefix	0 or 1
Ends With	Check output suffix	0 or 1
Length	Output length constraints	0 or 1
Levenshtein	Edit distance similarity	0–1
Similarity	TF-IDF cosine similarity	0–1
Latency	Response time scoring	0–1
Cost	Per-trace cost scoring	0–1
Response Time SLA	Latency SLA enforcement	0 or 1
Sentiment	Keyword-based sentiment	0–1
Toxicity	Blocklist-based toxicity check	0 or 1
Tool Call Validation	Validate tool/function calls	0 or 1

LLM-based

Type	Description	Score
LLM Judge	LLM-scored output quality	0–1
Groundedness	Hallucination detection	0–1
Prompt Injection	Injection attack detection	0 or 1
Bias Detection	Bias across demographics	0–1
Compliance Check	Regulatory compliance	0–1
Factual Accuracy	Fact verification	0–1
PII Detection	PII leak detection	0 or 1
Workflow Adherence	Workflow step validation	0–1

External

Type	Description	Score
Webhook	External HTTP evaluator endpoint	0–1
Custom	Your own logic via `POST /api/v1/scores`	0–1

How Evaluators Work

You configure evaluators in the dashboard (name, type, config)
When a trace is ingested, all enabled evaluators for that project are triggered
Evaluators run asynchronously in BullMQ workers
Each evaluator produces a score (0–1) with an optional label and reasoning
Scores appear on the trace in the dashboard

Score Format

Every evaluator returns:

Field	Type	Description
`score`	number	0–1 (0 = fail, 1 = perfect)
`label`	string	Optional: "pass" or "fail"
`reasoning`	string	Optional: explanation

Creating an Evaluator

In the dashboard, go to your project → Evaluators → Create. Select the evaluator type and configure it. Example for a Contains evaluator:

{
  "name": "mentions-refund-policy",
  "type": "CONTAINS",
  "config": {
    "value": "refund policy",
    "case_sensitive": false
  }
}

Custom Evaluators via API

If none of the built-in evaluators fit your use case, you can run your own evaluation logic externally and submit scores via the REST API. This is useful for custom ML models, domain-specific heuristics, or any scoring pipeline you manage yourself.

Send a POST request to /api/v1/scores with your API key and the score payload. Each score is attached to a specific trace by its ID.

import httpx

client = httpx.Client(
    base_url="https://app.2signal.dev",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)

# Submit a custom score for a trace
response = client.post("/api/v1/scores", json={
    "trace_id": "abc-123-def-456",
    "name": "domain-accuracy",
    "score": 0.92,
    "label": "pass",
    "reasoning": "Output correctly referenced 11 of 12 domain entities."
})

print(response.json())
# {"data": {"id": "score-789", "trace_id": "abc-123-def-456", ...}, "error": null}

You can call this endpoint from CI pipelines, post-processing scripts, or any service that has access to your traces. The score will appear alongside built-in evaluator scores in the dashboard.

The response follows the standard 2Signal format: { data: ..., error: null } on success, or { data: null, error: { message, code } } on failure. Common status codes are 201 (created), 400 (validation error), 401 (invalid API key), and 429 (rate limited).

Evaluator Lifecycle

Every evaluator follows a five-stage lifecycle from creation to results:

Create — Define the evaluator in the dashboard (or via API). Give it a name, select a type, and provide configuration. The evaluator is saved but not yet active.
Enable — Toggle the evaluator on for a project. Only enabled evaluators are triggered when traces arrive. You can disable an evaluator at any time without deleting it, which preserves historical scores.
Trigger — When a new trace is ingested, the trace-writer worker checks for all enabled evaluators on that project and enqueues an eval job per evaluator into the BullMQ queue.
Score — The eval-runner worker picks up the job, runs the evaluator logic against the trace data, and produces a score (0–1), an optional label, and optional reasoning. The score is persisted to PostgreSQL via the score-writer.
View — Scores appear on the trace detail page in the dashboard. You can filter traces by evaluator scores, track score distributions over time, and set up alerts when scores drop below thresholds.

Disabling an evaluator stops it from running on new traces but does not delete existing scores. Re-enabling it will only score traces ingested after re-enablement — it does not backfill.

Choosing the Right Evaluator

Use this guide to pick the evaluator that best fits your needs. You can (and should) combine multiple evaluators for comprehensive coverage.

If you need to...	Use	Why
Assess overall output quality or tone	LLM Judge	Only an LLM can evaluate nuanced quality, helpfulness, or adherence to style guidelines
Verify specific keywords or phrases appear in output	Contains	Fast, deterministic check — ideal for compliance phrases, disclaimers, or required terms
Validate output matches a pattern (emails, IDs, formats)	Regex Match	Flexible pattern matching for structured substrings without requiring full schema validation
Ensure output is valid JSON with a specific structure	JSON Schema	Guarantees your agent returns properly structured data — critical for tool-use agents
Compare output against a reference answer	Similarity	TF-IDF cosine similarity gives a 0–1 score without needing an LLM call
Enforce response time SLAs	Latency	Score traces based on how fast they complete — flag slow responses automatically
Track and cap per-trace spending	Cost	Scores based on token cost — catch expensive traces before they blow your budget
Run custom domain-specific logic	Custom (via API)	Submit scores from your own pipeline using `POST /api/v1/scores`

Performance

Evaluators are designed to add zero latency to your trace ingestion pipeline. When a trace is submitted via POST /api/v1/traces, the API returns immediately after persisting the raw data to S3. Evaluators run asynchronously in background workers — your agent never waits for scoring.

Execution times vary by evaluator type:

Category	Evaluators	Typical Latency
Deterministic	Contains, Regex Match, JSON Schema, Latency, Cost	<1ms per trace
Computational	Similarity (TF-IDF)	5–50ms per trace (depends on output length)
LLM-based	LLM Judge	1–5s per trace (depends on model and prompt length)

Since all evaluators run in BullMQ workers, even LLM Judge latency has no impact on your ingestion throughput. Workers process eval jobs concurrently, so multiple evaluators on the same trace run in parallel.

Evaluator Combinations

The most effective setups use multiple evaluators together. Each evaluator covers a different dimension of quality, giving you comprehensive visibility into your agent's behavior.

Here is a recommended combination for a customer-support agent:

// Project evaluator setup — customer support agent

// 1. Structure: ensure the agent returns valid JSON for downstream systems
{
  "name": "valid-response-format",
  "type": "JSON_SCHEMA",
  "config": {
    "schema": {
      "type": "object",
      "required": ["answer", "confidence", "sources"],
      "properties": {
        "answer": { "type": "string" },
        "confidence": { "type": "number", "minimum": 0, "maximum": 1 },
        "sources": { "type": "array", "items": { "type": "string" } }
      }
    }
  }
}

// 2. Compliance: check that required disclaimers are present
{
  "name": "includes-disclaimer",
  "type": "CONTAINS",
  "config": {
    "value": "This is not legal advice",
    "case_sensitive": false
  }
}

// 3. Cost: flag traces that cost more than $0.10
{
  "name": "cost-guard",
  "type": "COST",
  "config": {
    "max_cost": 0.10
  }
}

// 4. Latency: flag traces slower than 5 seconds
{
  "name": "latency-sla",
  "type": "LATENCY",
  "config": {
    "max_latency_ms": 5000
  }
}

// 5. Quality: LLM Judge for overall helpfulness
{
  "name": "helpfulness",
  "type": "LLM_JUDGE",
  "config": {
    "prompt": "Rate the helpfulness of this customer support response. Score 1 if the answer is accurate, complete, and actionable. Score 0 if it is vague, incorrect, or unhelpful.",
    "model": "gpt-4o"
  }
}

With this setup, every trace gets five scores. You can then filter in the dashboard for traces where any evaluator scored below a threshold — for example, all traces where helpfulness < 0.7 or cost-guard = 0 — to quickly find problem areas.

Managing Evaluators

You can manage evaluators through the dashboard, CLI, or TUI. All three surfaces support creating, listing, enabling, disabling, and deleting evaluators.

Dashboard

Navigate to your project, then click Evaluators in the sidebar. Use the Create button to add a new evaluator, or toggle the Enabled switch on existing ones. Click an evaluator name to edit its configuration.

CLI

# List all evaluators for a project
2signal evaluators list --project my-project

# Create a new evaluator
2signal evaluators create --project my-project \
  --name "output-check" \
  --type CONTAINS \
  --config '{"value": "thank you", "case_sensitive": false}'

# Enable/disable an evaluator
2signal evaluators enable --project my-project --name "output-check"
2signal evaluators disable --project my-project --name "output-check"

# Delete an evaluator (scores are preserved)
2signal evaluators delete --project my-project --name "output-check"

TUI

Launch the TUI with 2signal tui, navigate to the Evaluators panel, and use the keyboard shortcuts displayed at the bottom of the screen. The TUI provides a real-time view of evaluator status and recent scores.

Limits

The following limits apply to evaluators. These are per-project unless otherwise noted.

Resource	Free Tier	Pro Tier	Enterprise
Evaluators per project	5	25	Unlimited
Concurrent eval jobs (per project)	2	10	50
LLM Judge evaluators per project	1	5	Unlimited
Max eval config size	10 KB
Score retention	7 days	90 days	Unlimited
Custom scores via API (per month)	1,000	50,000	Unlimited

If you exceed the evaluator limit for your tier, new evaluator creation will return a 400 error with the message evaluator_limit_reached. Upgrade your plan or disable unused evaluators to free up slots.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord