Groundedness

LLM-powered hallucination detection that checks whether the agent output is faithfully grounded in the provided context. Uses OpenAI models to identify unsupported claims.

Config

FieldTypeRequiredDefaultDescription
modelstringNogpt-4o-miniOpenAI model to use. Allowed: gpt-4o-mini, gpt-4o, gpt-4.1-mini, gpt-4.1-nano, gpt-3.5-turbo
api_keystringNoOpenAI API key. Falls back to OPENAI_API_KEY environment variable
context_fieldstringNoinputWhich field provides the source of truth: input or expectedOutput

Use Cases

  • RAG pipeline validation — Verify that retrieval-augmented generation outputs are faithful to the retrieved documents and don't introduce fabricated information.
  • Hallucination detection — Catch cases where the agent generates plausible-sounding but factually unsupported claims that aren't grounded in the provided context.
  • Document summarization QA — Ensure summaries accurately reflect the source material without adding or distorting information.
  • Knowledge base compliance — Validate that agents answering from a knowledge base stick to the provided information rather than relying on training data.

Examples

Basic groundedness check (context from input)

// Check if output is grounded in the input context
{
  "model": "gpt-4o-mini",
  "context_field": "input"
}
// Input: "Our return policy allows returns within 30 days."
// Output: "You can return items within 30 days." → score: 1.0 (fully grounded)
// Output: "You can return items within 90 days for a full refund." → low score (unsupported claim)

Context from expectedOutput

// Use expectedOutput as the ground truth context
{
  "model": "gpt-4o",
  "context_field": "expectedOutput"
}
// expectedOutput: "Paris is the capital of France. It has a population of 2.1 million."
// Output: "Paris, the capital of France, has about 2.1 million residents." → high score
// Output: "Paris, the capital of France, was founded in 250 BC." → lower score (founding date not in context)

Using a more capable model

// Use gpt-4o for more nuanced analysis
{
  "model": "gpt-4o"
}

Scoring

The LLM assigns a raw score from 1 (mostly unsupported) to 5 (fully grounded), which is normalized to a 0–1 range: (raw_score - 1) / 4. A normalized score of 0.5 or above is labeled "pass", below is "fail". The reasoning includes the raw score, an explanation, and a list of any unsupported claims. Both context and output are truncated to 4,000 characters and sanitized before being sent to the LLM.

Performance

Groundedness makes one OpenAI API call per evaluation with a 30-second timeout. Latency depends on the chosen model (gpt-4o-mini is fastest, typically 1–3 seconds). Each evaluation incurs LLM token costs. For high-volume pipelines, consider using gpt-4o-mini or gpt-4.1-nano to balance cost and accuracy.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord