Evaluators
Evaluators score your traces automatically after ingestion. Configure them in the dashboard per project — they run asynchronously via background workers and never slow down your agent.
Built-in Evaluators
Deterministic
| Type | Description | Score |
|---|---|---|
| Contains | Check if output contains strings | 0 or 1 |
| Regex Match | Match output against regex | 0 or 1 |
| JSON Schema | Validate output structure | 0 or 1 |
| Exact Match | Exact string comparison | 0 or 1 |
| Starts With | Check output prefix | 0 or 1 |
| Ends With | Check output suffix | 0 or 1 |
| Length | Output length constraints | 0 or 1 |
| Levenshtein | Edit distance similarity | 0–1 |
| Similarity | TF-IDF cosine similarity | 0–1 |
| Latency | Response time scoring | 0–1 |
| Cost | Per-trace cost scoring | 0–1 |
| Response Time SLA | Latency SLA enforcement | 0 or 1 |
| Sentiment | Keyword-based sentiment | 0–1 |
| Toxicity | Blocklist-based toxicity check | 0 or 1 |
| Tool Call Validation | Validate tool/function calls | 0 or 1 |
LLM-based
| Type | Description | Score |
|---|---|---|
| LLM Judge | LLM-scored output quality | 0–1 |
| Groundedness | Hallucination detection | 0–1 |
| Prompt Injection | Injection attack detection | 0 or 1 |
| Bias Detection | Bias across demographics | 0–1 |
| Compliance Check | Regulatory compliance | 0–1 |
| Factual Accuracy | Fact verification | 0–1 |
| PII Detection | PII leak detection | 0 or 1 |
| Workflow Adherence | Workflow step validation | 0–1 |
External
| Type | Description | Score |
|---|---|---|
| Webhook | External HTTP evaluator endpoint | 0–1 |
| Custom | Your own logic via POST /api/v1/scores | 0–1 |
How Evaluators Work
- You configure evaluators in the dashboard (name, type, config)
- When a trace is ingested, all enabled evaluators for that project are triggered
- Evaluators run asynchronously in BullMQ workers
- Each evaluator produces a score (0–1) with an optional label and reasoning
- Scores appear on the trace in the dashboard
Score Format
Every evaluator returns:
| Field | Type | Description |
|---|---|---|
score | number | 0–1 (0 = fail, 1 = perfect) |
label | string | Optional: "pass" or "fail" |
reasoning | string | Optional: explanation |
Creating an Evaluator
In the dashboard, go to your project → Evaluators → Create. Select the evaluator type and configure it. Example for a Contains evaluator:
{
"name": "mentions-refund-policy",
"type": "CONTAINS",
"config": {
"value": "refund policy",
"case_sensitive": false
}
}Custom Evaluators via API
If none of the built-in evaluators fit your use case, you can run your own evaluation logic externally and submit scores via the REST API. This is useful for custom ML models, domain-specific heuristics, or any scoring pipeline you manage yourself.
Send a POST request to /api/v1/scores with your API key and the score payload. Each score is attached to a specific trace by its ID.
import httpx
client = httpx.Client(
base_url="https://app.2signal.dev",
headers={"Authorization": "Bearer YOUR_API_KEY"},
)
# Submit a custom score for a trace
response = client.post("/api/v1/scores", json={
"trace_id": "abc-123-def-456",
"name": "domain-accuracy",
"score": 0.92,
"label": "pass",
"reasoning": "Output correctly referenced 11 of 12 domain entities."
})
print(response.json())
# {"data": {"id": "score-789", "trace_id": "abc-123-def-456", ...}, "error": null}You can call this endpoint from CI pipelines, post-processing scripts, or any service that has access to your traces. The score will appear alongside built-in evaluator scores in the dashboard.
The response follows the standard 2Signal format: { data: ..., error: null } on success, or { data: null, error: { message, code } } on failure. Common status codes are 201 (created), 400 (validation error), 401 (invalid API key), and 429 (rate limited).
Evaluator Lifecycle
Every evaluator follows a five-stage lifecycle from creation to results:
- Create — Define the evaluator in the dashboard (or via API). Give it a name, select a type, and provide configuration. The evaluator is saved but not yet active.
- Enable — Toggle the evaluator on for a project. Only enabled evaluators are triggered when traces arrive. You can disable an evaluator at any time without deleting it, which preserves historical scores.
- Trigger — When a new trace is ingested, the trace-writer worker checks for all enabled evaluators on that project and enqueues an eval job per evaluator into the BullMQ queue.
- Score — The eval-runner worker picks up the job, runs the evaluator logic against the trace data, and produces a score (0–1), an optional label, and optional reasoning. The score is persisted to PostgreSQL via the score-writer.
- View — Scores appear on the trace detail page in the dashboard. You can filter traces by evaluator scores, track score distributions over time, and set up alerts when scores drop below thresholds.
Disabling an evaluator stops it from running on new traces but does not delete existing scores. Re-enabling it will only score traces ingested after re-enablement — it does not backfill.
Choosing the Right Evaluator
Use this guide to pick the evaluator that best fits your needs. You can (and should) combine multiple evaluators for comprehensive coverage.
| If you need to... | Use | Why |
|---|---|---|
| Assess overall output quality or tone | LLM Judge | Only an LLM can evaluate nuanced quality, helpfulness, or adherence to style guidelines |
| Verify specific keywords or phrases appear in output | Contains | Fast, deterministic check — ideal for compliance phrases, disclaimers, or required terms |
| Validate output matches a pattern (emails, IDs, formats) | Regex Match | Flexible pattern matching for structured substrings without requiring full schema validation |
| Ensure output is valid JSON with a specific structure | JSON Schema | Guarantees your agent returns properly structured data — critical for tool-use agents |
| Compare output against a reference answer | Similarity | TF-IDF cosine similarity gives a 0–1 score without needing an LLM call |
| Enforce response time SLAs | Latency | Score traces based on how fast they complete — flag slow responses automatically |
| Track and cap per-trace spending | Cost | Scores based on token cost — catch expensive traces before they blow your budget |
| Run custom domain-specific logic | Custom (via API) | Submit scores from your own pipeline using POST /api/v1/scores |
Performance
Evaluators are designed to add zero latency to your trace ingestion pipeline. When a trace is submitted via POST /api/v1/traces, the API returns immediately after persisting the raw data to S3. Evaluators run asynchronously in background workers — your agent never waits for scoring.
Execution times vary by evaluator type:
| Category | Evaluators | Typical Latency |
|---|---|---|
| Deterministic | Contains, Regex Match, JSON Schema, Latency, Cost | <1ms per trace |
| Computational | Similarity (TF-IDF) | 5–50ms per trace (depends on output length) |
| LLM-based | LLM Judge | 1–5s per trace (depends on model and prompt length) |
Since all evaluators run in BullMQ workers, even LLM Judge latency has no impact on your ingestion throughput. Workers process eval jobs concurrently, so multiple evaluators on the same trace run in parallel.
Evaluator Combinations
The most effective setups use multiple evaluators together. Each evaluator covers a different dimension of quality, giving you comprehensive visibility into your agent's behavior.
Here is a recommended combination for a customer-support agent:
// Project evaluator setup — customer support agent
// 1. Structure: ensure the agent returns valid JSON for downstream systems
{
"name": "valid-response-format",
"type": "JSON_SCHEMA",
"config": {
"schema": {
"type": "object",
"required": ["answer", "confidence", "sources"],
"properties": {
"answer": { "type": "string" },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 },
"sources": { "type": "array", "items": { "type": "string" } }
}
}
}
}
// 2. Compliance: check that required disclaimers are present
{
"name": "includes-disclaimer",
"type": "CONTAINS",
"config": {
"value": "This is not legal advice",
"case_sensitive": false
}
}
// 3. Cost: flag traces that cost more than $0.10
{
"name": "cost-guard",
"type": "COST",
"config": {
"max_cost": 0.10
}
}
// 4. Latency: flag traces slower than 5 seconds
{
"name": "latency-sla",
"type": "LATENCY",
"config": {
"max_latency_ms": 5000
}
}
// 5. Quality: LLM Judge for overall helpfulness
{
"name": "helpfulness",
"type": "LLM_JUDGE",
"config": {
"prompt": "Rate the helpfulness of this customer support response. Score 1 if the answer is accurate, complete, and actionable. Score 0 if it is vague, incorrect, or unhelpful.",
"model": "gpt-4o"
}
}With this setup, every trace gets five scores. You can then filter in the dashboard for traces where any evaluator scored below a threshold — for example, all traces where helpfulness < 0.7 or cost-guard = 0 — to quickly find problem areas.
Managing Evaluators
You can manage evaluators through the dashboard, CLI, or TUI. All three surfaces support creating, listing, enabling, disabling, and deleting evaluators.
Dashboard
Navigate to your project, then click Evaluators in the sidebar. Use the Create button to add a new evaluator, or toggle the Enabled switch on existing ones. Click an evaluator name to edit its configuration.
CLI
# List all evaluators for a project
2signal evaluators list --project my-project
# Create a new evaluator
2signal evaluators create --project my-project \
--name "output-check" \
--type CONTAINS \
--config '{"value": "thank you", "case_sensitive": false}'
# Enable/disable an evaluator
2signal evaluators enable --project my-project --name "output-check"
2signal evaluators disable --project my-project --name "output-check"
# Delete an evaluator (scores are preserved)
2signal evaluators delete --project my-project --name "output-check"TUI
Launch the TUI with 2signal tui, navigate to the Evaluators panel, and use the keyboard shortcuts displayed at the bottom of the screen. The TUI provides a real-time view of evaluator status and recent scores.
Limits
The following limits apply to evaluators. These are per-project unless otherwise noted.
| Resource | Free Tier | Pro Tier | Enterprise |
|---|---|---|---|
| Evaluators per project | 5 | 25 | Unlimited |
| Concurrent eval jobs (per project) | 2 | 10 | 50 |
| LLM Judge evaluators per project | 1 | 5 | Unlimited |
| Max eval config size | 10 KB | ||
| Score retention | 7 days | 90 days | Unlimited |
| Custom scores via API (per month) | 1,000 | 50,000 | Unlimited |
If you exceed the evaluator limit for your tier, new evaluator creation will return a 400 error with the message evaluator_limit_reached. Upgrade your plan or disable unused evaluators to free up slots.