Prompt Injection

Detects prompt injection and jailbreak attempts in agent inputs and outputs. Supports deterministic pattern matching, LLM-based analysis, or both combined. The deterministic mode scans for 25+ known injection patterns including instruction override, role reassignment, encoding tricks, and prompt extraction.

Config

FieldTypeRequiredDefaultDescription
modestringNodeterministicdeterministic, llm, or both
modelstringNogpt-4o-miniOpenAI model for LLM mode (must be in allowed list)
api_keystringNoenv OPENAI_API_KEYOpenAI API key for LLM mode
checkstringNoinputinput, output, or both — what text to scan

Use Cases

  • Input guardrails — Scan user messages before they reach your agent to block known jailbreak patterns like "ignore previous instructions" or "DAN mode."
  • Output safety — Check agent responses for signs that an injection succeeded, such as system prompt leakage or role-play compliance.
  • Defense in depth — Use mode: "both" to combine fast deterministic pattern matching with LLM reasoning for catching novel attack vectors that don't match known patterns.
  • Security auditing — Run against historical traces to identify past injection attempts and measure your agent's resilience over time.

Examples

Deterministic only (fastest)

// Scan input for known injection patterns
{
  "mode": "deterministic",
  "check": "input"
}

LLM-based detection

// Use GPT-4o for nuanced injection detection
{
  "mode": "llm",
  "model": "gpt-4o",
  "check": "both"
}

Combined mode

// Deterministic + LLM for maximum coverage
{
  "mode": "both",
  "model": "gpt-4o-mini",
  "check": "input"
}
// Fails if EITHER method detects an injection

Scoring

Returns 1.0 (pass — no injection detected) or 0.0 (fail — injection detected). No intermediate scores. In both mode, the check fails if either the deterministic patterns or the LLM flags an injection. The reasoning field lists which specific patterns were matched and/or the LLM's explanation.

Performance

Deterministic mode runs in under 1ms with no external calls — ideal for high-volume pipelines. LLM mode adds an OpenAI API call (typically 1–3 seconds depending on model). In both mode, the deterministic check runs first, followed by the LLM call. Empty text inputs short-circuit to a pass immediately.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord