Factual Accuracy
LLM-based evaluator that extracts individual factual claims from the agent's output and verifies each one against a provided context (source of truth). Returns the ratio of verified claims to total claims, making it effective for detecting hallucinations and unsupported assertions.
Config
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | No | gpt-4o-mini | OpenAI model to use (must be in allowed list) |
api_key | string | No | env OPENAI_API_KEY | OpenAI API key |
context_field | string | No | input | Field providing context: input or expectedOutput |
strict | boolean | No | false | If true, ambiguous or partially supported claims count as unverified |
Use Cases
- RAG pipelines — Verify that retrieval-augmented generation responses only contain facts supported by the retrieved documents, catching hallucinated details.
- Knowledge base agents — Ensure customer support agents don't fabricate product features, pricing, or policy details that aren't in the source material.
- Summarization quality — Check that summaries accurately reflect the original content without introducing new or contradictory information.
- Strict compliance — Enable
strictmode for regulated domains where even ambiguous claims must be flagged for human review.
Examples
Basic factual check (context from input)
{
"model": "gpt-4o-mini",
"context_field": "input"
}
// The agent's input (e.g. retrieved documents) is used as the source of truth
// The agent's output is checked for unsupported claimsUsing expected output as context
{
"model": "gpt-4o",
"context_field": "expectedOutput"
}
// Useful for dataset evaluations where expectedOutput contains the ground truthStrict mode
{
"model": "gpt-4o",
"strict": true
}
// Ambiguous claims are counted as unverified
// Better for medical, legal, or financial agentsScoring
Returns verified_claims / total_claims, rounded to two decimal places. A score of 1.0 means every claim in the output is supported by the context. Scores at or above 0.5 are labeled "pass"; below 0.5 are labeled "fail." If no factual claims are found in the output, the score is 1.0. The reasoning field lists the claim counts and any specific unverified claims.
Performance
Requires an OpenAI API call (typically 2–5 seconds depending on output length and model). Both context and output are sanitized and truncated to 4,000 characters before sending to the LLM. Missing context or output short-circuits to a fail immediately without making an API call.