Cookbook

Safety Evaluators

This recipe sets up a comprehensive safety monitoring pipeline using 2Signal's built-in evaluators for bias, PII leaks, toxicity, and prompt injection — catching safety issues automatically on every trace.

Prerequisites

  • A 2Signal project with traces flowing in
  • An OpenAI API key configured (for LLM-based evaluators)

The Safety Stack

Create these four evaluators in your project. Together they cover the most common AI safety risks:

1. Toxicity (Deterministic)

{
  "name": "toxicity-check",
  "type": "TOXICITY",
  "config": {}
}
// Uses a blocklist-based approach — fast, no API calls
// Catches profanity, slurs, and known toxic patterns

2. PII Detection (LLM-based)

{
  "name": "pii-leak-check",
  "type": "PII_DETECTION",
  "config": {
    "model": "gpt-4o-mini"
  }
}
// Detects emails, phone numbers, SSNs, addresses, etc.
// in agent outputs that shouldn't contain PII

3. Bias Detection (Hybrid)

{
  "name": "bias-check",
  "type": "BIAS_DETECTION",
  "config": {
    "mode": "both",
    "model": "gpt-4o-mini"
  }
}
// "both" mode runs deterministic keyword checks AND LLM analysis
// Categories: gender, race, age, disability, religion

4. Prompt Injection (Deterministic + LLM)

{
  "name": "prompt-injection-guard",
  "type": "PROMPT_INJECTION",
  "config": {
    "model": "gpt-4o-mini"
  }
}
// Checks if user input contains prompt injection attempts
// Deterministic patterns catch common attacks, LLM catches novel ones

Setting Up Alerts

With safety evaluators in place, configure alerts to get notified immediately when issues are detected:

// Alert: PII leak rate exceeds 1%
{
  "metric": "PII_LEAK_RATE",
  "threshold": 0.01,
  "window_minutes": 60,
  "channel": "SLACK"
}

// Alert: Bias score average drops (higher = more bias detected)
{
  "metric": "BIAS_SCORE_AVG",
  "threshold": 0.1,
  "window_minutes": 60,
  "channel": "EMAIL"
}

Monitoring in the Dashboard

After enabling these evaluators, you can:

  • Filter traces by evaluator score to find flagged outputs
  • Track safety metrics over time on the Overview page
  • Set up drift detection to catch gradual safety degradation
  • Export flagged traces to review queues for human verification

Performance Notes

EvaluatorTypeLatencyCost
ToxicityDeterministic<1msFree
PII DetectionLLM1–3s~$0.001/trace
Bias Detection (both)Hybrid1–3s~$0.001/trace
Prompt InjectionHybrid1–3s~$0.001/trace

All evaluators run asynchronously in workers — they never slow down trace ingestion.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord