Evaluate Agent Outputs
Evaluators let you automatically score every trace that flows through 2Signal. Once enabled, they run in the background on every new trace and surface pass/fail results and numeric scores on the trace detail page. This recipe walks through creating evaluators and understanding how scoring works.
Step 1: Create an Evaluator via the Dashboard
Navigate to your project's Evaluators page and click New Evaluator. Choose an evaluator type from the dropdown — each type has its own configuration fields. Give it a descriptive name, fill in the config, and save. The evaluator immediately starts scoring new traces.
Step 2: Example Evaluator Configs
Below are four practical evaluators you can create right away. Each JSON block shows the config you would supply when creating the evaluator.
Check for Required Keywords (CONTAINS)
Verify that the agent's output mentions at least one of the expected terms. Useful for ensuring responses reference pricing, disclaimers, or other required content.
{
"type": "CONTAINS",
"name": "mentions-pricing",
"config": {
"value": ["pricing", "cost", "$"],
"mode": "any",
"case_sensitive": false
}
}Validate JSON Output (JSON_SCHEMA)
When your agent returns structured JSON, this evaluator validates it against a JSON Schema. The trace passes only if the output conforms to every constraint.
{
"type": "JSON_SCHEMA",
"name": "valid-response-format",
"config": {
"schema": {
"type": "object",
"required": ["answer", "confidence"],
"properties": {
"answer": { "type": "string", "minLength": 1 },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 }
}
}
}
}Grade Response Quality (LLM_JUDGE)
Use an LLM to judge the quality of each response on a numeric scale. The criteria prompt tells the judge exactly what to look for.
{
"type": "LLM_JUDGE",
"name": "helpfulness",
"config": {
"criteria": "Does the response directly answer the user's question with accurate, actionable information? Score 1-5 where 5 is a perfect answer.",
"scale": "1-5",
"model": "gpt-4o-mini"
}
}Enforce Latency SLA (LATENCY)
Flag traces that exceed your latency budget. Set a hard maximum and an optional target — scores are proportional to how close the response time is to the target.
{
"type": "LATENCY",
"name": "response-time-sla",
"config": {
"max_ms": 5000,
"target_ms": 2000
}
}Step 3: How Scoring Works
Evaluation is fully asynchronous so it never slows down trace ingestion. Here is the flow:
- Your SDK sends a trace to
POST /api/v1/traces. The API returns immediately. - The trace is persisted to S3 and then written to PostgreSQL by the trace-writer worker.
- The trace-writer enqueues a job on the evalRunnerQueue.
- The eval-runner worker picks up the job, runs every enabled evaluator for the project, and writes a score for each one.
- Scores appear on the trace detail page in the dashboard. If any alert rules are configured, the eval-runner also enqueues an alert-checker job.
Because evals run in background workers, there is typically a short delay (seconds) between trace ingestion and scores being visible.
Step 4: Submit Scores via API
If you compute scores outside of 2Signal (for example in a custom pipeline or CI job), you can push them directly through the REST API:
curl -X POST https://your-instance.2signal.dev/api/v1/scores \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"traceId": "trace-uuid",
"evaluatorName": "custom-check",
"value": 0.95,
"label": "PASS",
"reasoning": "All criteria met"
}'The score is associated with the trace and appears alongside evaluator-generated scores in the dashboard.
What's Next
- Validate Output Format — Deep dive into JSON Schema and Regex evaluators for enforcing structured outputs.
- Monitor Cost & Latency — Track spend and response times with automatic alerts.