Cookbook

Safety Evaluators

This recipe sets up a comprehensive safety monitoring pipeline using 2Signal's built-in evaluators for bias, PII leaks, toxicity, and prompt injection — catching safety issues automatically on every trace.

Prerequisites

A 2Signal project with traces flowing in
An OpenAI API key configured (for LLM-based evaluators)

The Safety Stack

Create these four evaluators in your project. Together they cover the most common AI safety risks:

1. Toxicity (Deterministic)

{
  "name": "toxicity-check",
  "type": "TOXICITY",
  "config": {}
}
// Uses a blocklist-based approach — fast, no API calls
// Catches profanity, slurs, and known toxic patterns

2. PII Detection (LLM-based)

{
  "name": "pii-leak-check",
  "type": "PII_DETECTION",
  "config": {
    "model": "gpt-4o-mini"
  }
}
// Detects emails, phone numbers, SSNs, addresses, etc.
// in agent outputs that shouldn't contain PII

3. Bias Detection (Hybrid)

{
  "name": "bias-check",
  "type": "BIAS_DETECTION",
  "config": {
    "mode": "both",
    "model": "gpt-4o-mini"
  }
}
// "both" mode runs deterministic keyword checks AND LLM analysis
// Categories: gender, race, age, disability, religion

4. Prompt Injection (Deterministic + LLM)

{
  "name": "prompt-injection-guard",
  "type": "PROMPT_INJECTION",
  "config": {
    "model": "gpt-4o-mini"
  }
}
// Checks if user input contains prompt injection attempts
// Deterministic patterns catch common attacks, LLM catches novel ones

Setting Up Alerts

With safety evaluators in place, configure alerts to get notified immediately when issues are detected:

// Alert: PII leak rate exceeds 1%
{
  "metric": "PII_LEAK_RATE",
  "threshold": 0.01,
  "window_minutes": 60,
  "channel": "SLACK"
}

// Alert: Bias score average drops (higher = more bias detected)
{
  "metric": "BIAS_SCORE_AVG",
  "threshold": 0.1,
  "window_minutes": 60,
  "channel": "EMAIL"
}

Monitoring in the Dashboard

After enabling these evaluators, you can:

Filter traces by evaluator score to find flagged outputs
Track safety metrics over time on the Overview page
Set up drift detection to catch gradual safety degradation
Export flagged traces to review queues for human verification

Performance Notes

Evaluator	Type	Latency	Cost
Toxicity	Deterministic	<1ms	Free
PII Detection	LLM	1–3s	~$0.001/trace
Bias Detection (both)	Hybrid	1–3s	~$0.001/trace
Prompt Injection	Hybrid	1–3s	~$0.001/trace

All evaluators run asynchronously in workers — they never slow down trace ingestion.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord