Prompt Injection

Detects prompt injection and jailbreak attempts in agent inputs and outputs. Supports deterministic pattern matching, LLM-based analysis, or both combined. The deterministic mode scans for 25+ known injection patterns including instruction override, role reassignment, encoding tricks, and prompt extraction.

Config

Field	Type	Required	Default	Description
`mode`	string	No	`deterministic`	`deterministic`, `llm`, or `both`
`model`	string	No	`gpt-4o-mini`	OpenAI model for LLM mode (must be in allowed list)
`api_key`	string	No	env `OPENAI_API_KEY`	OpenAI API key for LLM mode
`check`	string	No	`input`	`input`, `output`, or `both` — what text to scan

Use Cases

Input guardrails — Scan user messages before they reach your agent to block known jailbreak patterns like "ignore previous instructions" or "DAN mode."
Output safety — Check agent responses for signs that an injection succeeded, such as system prompt leakage or role-play compliance.
Defense in depth — Use mode: "both" to combine fast deterministic pattern matching with LLM reasoning for catching novel attack vectors that don't match known patterns.
Security auditing — Run against historical traces to identify past injection attempts and measure your agent's resilience over time.

Examples

Deterministic only (fastest)

// Scan input for known injection patterns
{
  "mode": "deterministic",
  "check": "input"
}

LLM-based detection

// Use GPT-4o for nuanced injection detection
{
  "mode": "llm",
  "model": "gpt-4o",
  "check": "both"
}

Combined mode

// Deterministic + LLM for maximum coverage
{
  "mode": "both",
  "model": "gpt-4o-mini",
  "check": "input"
}
// Fails if EITHER method detects an injection

Scoring

Returns 1.0 (pass — no injection detected) or 0.0 (fail — injection detected). No intermediate scores. In both mode, the check fails if either the deterministic patterns or the LLM flags an injection. The reasoning field lists which specific patterns were matched and/or the LLM's explanation.

Performance

Deterministic mode runs in under 1ms with no external calls — ideal for high-volume pipelines. LLM mode adds an OpenAI API call (typically 1–3 seconds depending on model). In both mode, the deterministic check runs first, followed by the LLM call. Empty text inputs short-circuit to a pass immediately.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord