Cookbook
Safety Evaluators
This recipe sets up a comprehensive safety monitoring pipeline using 2Signal's built-in evaluators for bias, PII leaks, toxicity, and prompt injection — catching safety issues automatically on every trace.
Prerequisites
- A 2Signal project with traces flowing in
- An OpenAI API key configured (for LLM-based evaluators)
The Safety Stack
Create these four evaluators in your project. Together they cover the most common AI safety risks:
1. Toxicity (Deterministic)
{
"name": "toxicity-check",
"type": "TOXICITY",
"config": {}
}
// Uses a blocklist-based approach — fast, no API calls
// Catches profanity, slurs, and known toxic patterns2. PII Detection (LLM-based)
{
"name": "pii-leak-check",
"type": "PII_DETECTION",
"config": {
"model": "gpt-4o-mini"
}
}
// Detects emails, phone numbers, SSNs, addresses, etc.
// in agent outputs that shouldn't contain PII3. Bias Detection (Hybrid)
{
"name": "bias-check",
"type": "BIAS_DETECTION",
"config": {
"mode": "both",
"model": "gpt-4o-mini"
}
}
// "both" mode runs deterministic keyword checks AND LLM analysis
// Categories: gender, race, age, disability, religion4. Prompt Injection (Deterministic + LLM)
{
"name": "prompt-injection-guard",
"type": "PROMPT_INJECTION",
"config": {
"model": "gpt-4o-mini"
}
}
// Checks if user input contains prompt injection attempts
// Deterministic patterns catch common attacks, LLM catches novel onesSetting Up Alerts
With safety evaluators in place, configure alerts to get notified immediately when issues are detected:
// Alert: PII leak rate exceeds 1%
{
"metric": "PII_LEAK_RATE",
"threshold": 0.01,
"window_minutes": 60,
"channel": "SLACK"
}
// Alert: Bias score average drops (higher = more bias detected)
{
"metric": "BIAS_SCORE_AVG",
"threshold": 0.1,
"window_minutes": 60,
"channel": "EMAIL"
}Monitoring in the Dashboard
After enabling these evaluators, you can:
- Filter traces by evaluator score to find flagged outputs
- Track safety metrics over time on the Overview page
- Set up drift detection to catch gradual safety degradation
- Export flagged traces to review queues for human verification
Performance Notes
| Evaluator | Type | Latency | Cost |
|---|---|---|---|
| Toxicity | Deterministic | <1ms | Free |
| PII Detection | LLM | 1–3s | ~$0.001/trace |
| Bias Detection (both) | Hybrid | 1–3s | ~$0.001/trace |
| Prompt Injection | Hybrid | 1–3s | ~$0.001/trace |
All evaluators run asynchronously in workers — they never slow down trace ingestion.