Groundedness
LLM-powered hallucination detection that checks whether the agent output is faithfully grounded in the provided context. Uses OpenAI models to identify unsupported claims.
Config
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | No | gpt-4o-mini | OpenAI model to use. Allowed: gpt-4o-mini, gpt-4o, gpt-4.1-mini, gpt-4.1-nano, gpt-3.5-turbo |
api_key | string | No | — | OpenAI API key. Falls back to OPENAI_API_KEY environment variable |
context_field | string | No | input | Which field provides the source of truth: input or expectedOutput |
Use Cases
- RAG pipeline validation — Verify that retrieval-augmented generation outputs are faithful to the retrieved documents and don't introduce fabricated information.
- Hallucination detection — Catch cases where the agent generates plausible-sounding but factually unsupported claims that aren't grounded in the provided context.
- Document summarization QA — Ensure summaries accurately reflect the source material without adding or distorting information.
- Knowledge base compliance — Validate that agents answering from a knowledge base stick to the provided information rather than relying on training data.
Examples
Basic groundedness check (context from input)
// Check if output is grounded in the input context
{
"model": "gpt-4o-mini",
"context_field": "input"
}
// Input: "Our return policy allows returns within 30 days."
// Output: "You can return items within 30 days." → score: 1.0 (fully grounded)
// Output: "You can return items within 90 days for a full refund." → low score (unsupported claim)Context from expectedOutput
// Use expectedOutput as the ground truth context
{
"model": "gpt-4o",
"context_field": "expectedOutput"
}
// expectedOutput: "Paris is the capital of France. It has a population of 2.1 million."
// Output: "Paris, the capital of France, has about 2.1 million residents." → high score
// Output: "Paris, the capital of France, was founded in 250 BC." → lower score (founding date not in context)Using a more capable model
// Use gpt-4o for more nuanced analysis
{
"model": "gpt-4o"
}Scoring
The LLM assigns a raw score from 1 (mostly unsupported) to 5 (fully grounded), which is normalized to a 0–1 range: (raw_score - 1) / 4. A normalized score of 0.5 or above is labeled "pass", below is "fail". The reasoning includes the raw score, an explanation, and a list of any unsupported claims. Both context and output are truncated to 4,000 characters and sanitized before being sent to the LLM.
Performance
Groundedness makes one OpenAI API call per evaluation with a 30-second timeout. Latency depends on the chosen model (gpt-4o-mini is fastest, typically 1–3 seconds). Each evaluation incurs LLM token costs. For high-volume pipelines, consider using gpt-4o-mini or gpt-4.1-nano to balance cost and accuracy.