Prompt Injection
Detects prompt injection and jailbreak attempts in agent inputs and outputs. Supports deterministic pattern matching, LLM-based analysis, or both combined. The deterministic mode scans for 25+ known injection patterns including instruction override, role reassignment, encoding tricks, and prompt extraction.
Config
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
mode | string | No | deterministic | deterministic, llm, or both |
model | string | No | gpt-4o-mini | OpenAI model for LLM mode (must be in allowed list) |
api_key | string | No | env OPENAI_API_KEY | OpenAI API key for LLM mode |
check | string | No | input | input, output, or both — what text to scan |
Use Cases
- Input guardrails — Scan user messages before they reach your agent to block known jailbreak patterns like "ignore previous instructions" or "DAN mode."
- Output safety — Check agent responses for signs that an injection succeeded, such as system prompt leakage or role-play compliance.
- Defense in depth — Use
mode: "both"to combine fast deterministic pattern matching with LLM reasoning for catching novel attack vectors that don't match known patterns. - Security auditing — Run against historical traces to identify past injection attempts and measure your agent's resilience over time.
Examples
Deterministic only (fastest)
// Scan input for known injection patterns
{
"mode": "deterministic",
"check": "input"
}LLM-based detection
// Use GPT-4o for nuanced injection detection
{
"mode": "llm",
"model": "gpt-4o",
"check": "both"
}Combined mode
// Deterministic + LLM for maximum coverage
{
"mode": "both",
"model": "gpt-4o-mini",
"check": "input"
}
// Fails if EITHER method detects an injectionScoring
Returns 1.0 (pass — no injection detected) or 0.0 (fail — injection detected). No intermediate scores. In both mode, the check fails if either the deterministic patterns or the LLM flags an injection. The reasoning field lists which specific patterns were matched and/or the LLM's explanation.
Performance
Deterministic mode runs in under 1ms with no external calls — ideal for high-volume pipelines. LLM mode adds an OpenAI API call (typically 1–3 seconds depending on model). In both mode, the deterministic check runs first, followed by the LLM call. Empty text inputs short-circuit to a pass immediately.