Similarity

Compares agent output against an expected output using TF-IDF cosine similarity. No LLM calls — runs locally.

Config

Field	Type	Required	Default	Description
`threshold`	number	No	`0.7`	Min score to pass
`tokenizer`	string	No	`word`	`word` or `char_ngram`
`ngram_size`	number	No	`3`	Character n-gram size (when using char_ngram)

How TF-IDF Similarity Works

The evaluator computes similarity in three steps:

Tokenize — The agent output and expected output are split into tokens. With the word tokenizer, tokens are individual words. With char_ngram, tokens are overlapping character sequences of length ngram_size.
Compute TF-IDF vectors — Each token is weighted by its term frequency (how often it appears in the text) multiplied by its inverse document frequency (how rare it is across both texts). Common words like "the" get low weight; distinctive words get high weight.
Cosine similarity — The two TF-IDF vectors are compared using cosine similarity, which measures the angle between them. A score of 1.0 means identical content; 0.0 means no overlap.

Word vs Character N-gram Tokenizer

Tokenizer	Best For	Example
`word`	Semantic similarity — comparing meaning and topic coverage	"The refund was processed" vs "Your refund has been completed" (high score, shared key words)
`char_ngram`	Fuzzy matching — tolerating typos, abbreviations, and minor formatting differences	"John Smith" vs "Jon Smth" (still scores well due to overlapping character sequences)

Use word (the default) in most cases. Switch to char_ngram when comparing short strings, proper nouns, or outputs where small character-level differences should not cause failure.

Use Cases

Regression testing — Compare agent output against golden answers from a dataset. Catch regressions when a model update or prompt change causes answers to drift from known-good responses.
Consistency checking — Run the same prompt multiple times and compare outputs against each other to measure how consistent your agent's responses are.
Paraphrase detection — Verify that the agent conveys the same information as the expected answer, even if the wording is different.
Translation and summarization QA — Compare translated or summarized output against reference texts to ensure key information is preserved.

Choosing a Threshold

Threshold	Use Case	When to Use
`0.9+`	Near-exact match	Factual answers, data extraction, structured responses where wording should closely match
`0.7 – 0.9`	Paraphrase detection	The agent should convey the same information but phrasing can vary
`0.5 – 0.7`	Loose similarity	Topic alignment — the response should be about the same subject but details may differ
`< 0.5`	Rarely useful	At this level, the texts share very few terms; consider a different evaluator

Start with the default threshold of 0.7 and adjust based on your evaluation results. Check the raw similarity scores in your dashboard to find the natural cutoff for your use case.

Examples

Basic word similarity

{
  "threshold": 0.8,
  "tokenizer": "word"
}

Fuzzy matching with character n-grams

// Tolerant of typos and minor differences
{
  "threshold": 0.6,
  "tokenizer": "char_ngram",
  "ngram_size": 3
}

Strict regression test

// Answers should closely match golden examples
{
  "threshold": 0.9,
  "tokenizer": "word"
}

Providing Expected Output

This evaluator requires an expectedOutput to compare against. There are two ways to provide it:

Datasets — Create a dataset with input/expected-output pairs. When you run evaluations against a dataset, 2Signal automatically supplies the expected output for each test case. This is the recommended approach for regression testing.
Trace metadata — Include expectedOutput in the metadata when sending traces via the SDK. Useful for real-time monitoring where you have a reference answer available at runtime.

If no expected output is available, the evaluator returns a score of 0.0 with reasoning explaining that the comparison could not be performed.

Scoring

Returns the cosine similarity score as a continuous value between 0 and 1. Label is "pass" if score >= threshold, "fail" otherwise. Unlike binary evaluators, Similarity gives you a gradient — you can see exactly how close the output was to the expected answer.

Performance

Similarity runs entirely locally with no external API calls. TF-IDF computation and cosine similarity typically complete in under 5ms, even for long texts. This makes it safe to run on every trace without impacting throughput or incurring additional costs.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord