Factual Accuracy

LLM-based evaluator that extracts individual factual claims from the agent's output and verifies each one against a provided context (source of truth). Returns the ratio of verified claims to total claims, making it effective for detecting hallucinations and unsupported assertions.

Config

Field	Type	Required	Default	Description
`model`	string	No	`gpt-4o-mini`	OpenAI model to use (must be in allowed list)
`api_key`	string	No	env `OPENAI_API_KEY`	OpenAI API key
`context_field`	string	No	`input`	Field providing context: `input` or `expectedOutput`
`strict`	boolean	No	`false`	If true, ambiguous or partially supported claims count as unverified

Use Cases

RAG pipelines — Verify that retrieval-augmented generation responses only contain facts supported by the retrieved documents, catching hallucinated details.
Knowledge base agents — Ensure customer support agents don't fabricate product features, pricing, or policy details that aren't in the source material.
Summarization quality — Check that summaries accurately reflect the original content without introducing new or contradictory information.
Strict compliance — Enable strict mode for regulated domains where even ambiguous claims must be flagged for human review.

Examples

Basic factual check (context from input)

{
  "model": "gpt-4o-mini",
  "context_field": "input"
}
// The agent's input (e.g. retrieved documents) is used as the source of truth
// The agent's output is checked for unsupported claims

Using expected output as context

{
  "model": "gpt-4o",
  "context_field": "expectedOutput"
}
// Useful for dataset evaluations where expectedOutput contains the ground truth

Strict mode

{
  "model": "gpt-4o",
  "strict": true
}
// Ambiguous claims are counted as unverified
// Better for medical, legal, or financial agents

Scoring

Returns verified_claims / total_claims, rounded to two decimal places. A score of 1.0 means every claim in the output is supported by the context. Scores at or above 0.5 are labeled "pass"; below 0.5 are labeled "fail." If no factual claims are found in the output, the score is 1.0. The reasoning field lists the claim counts and any specific unverified claims.

Performance

Requires an OpenAI API call (typically 2–5 seconds depending on output length and model). Both context and output are sanitized and truncated to 4,000 characters before sending to the LLM. Missing context or output short-circuits to a fail immediately without making an API call.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord