Similarity

Compares agent output against an expected output using TF-IDF cosine similarity. No LLM calls — runs locally.

Config

FieldTypeRequiredDefaultDescription
thresholdnumberNo0.7Min score to pass
tokenizerstringNowordword or char_ngram
ngram_sizenumberNo3Character n-gram size (when using char_ngram)

How TF-IDF Similarity Works

The evaluator computes similarity in three steps:

  1. Tokenize — The agent output and expected output are split into tokens. With the word tokenizer, tokens are individual words. With char_ngram, tokens are overlapping character sequences of length ngram_size.
  2. Compute TF-IDF vectors — Each token is weighted by its term frequency (how often it appears in the text) multiplied by its inverse document frequency (how rare it is across both texts). Common words like "the" get low weight; distinctive words get high weight.
  3. Cosine similarity — The two TF-IDF vectors are compared using cosine similarity, which measures the angle between them. A score of 1.0 means identical content; 0.0 means no overlap.

Word vs Character N-gram Tokenizer

TokenizerBest ForExample
wordSemantic similarity — comparing meaning and topic coverage"The refund was processed" vs "Your refund has been completed" (high score, shared key words)
char_ngramFuzzy matching — tolerating typos, abbreviations, and minor formatting differences"John Smith" vs "Jon Smth" (still scores well due to overlapping character sequences)

Use word (the default) in most cases. Switch to char_ngram when comparing short strings, proper nouns, or outputs where small character-level differences should not cause failure.

Use Cases

  • Regression testing — Compare agent output against golden answers from a dataset. Catch regressions when a model update or prompt change causes answers to drift from known-good responses.
  • Consistency checking — Run the same prompt multiple times and compare outputs against each other to measure how consistent your agent's responses are.
  • Paraphrase detection — Verify that the agent conveys the same information as the expected answer, even if the wording is different.
  • Translation and summarization QA — Compare translated or summarized output against reference texts to ensure key information is preserved.

Choosing a Threshold

ThresholdUse CaseWhen to Use
0.9+Near-exact matchFactual answers, data extraction, structured responses where wording should closely match
0.7 – 0.9Paraphrase detectionThe agent should convey the same information but phrasing can vary
0.5 – 0.7Loose similarityTopic alignment — the response should be about the same subject but details may differ
< 0.5Rarely usefulAt this level, the texts share very few terms; consider a different evaluator

Start with the default threshold of 0.7 and adjust based on your evaluation results. Check the raw similarity scores in your dashboard to find the natural cutoff for your use case.

Examples

Basic word similarity

{
  "threshold": 0.8,
  "tokenizer": "word"
}

Fuzzy matching with character n-grams

// Tolerant of typos and minor differences
{
  "threshold": 0.6,
  "tokenizer": "char_ngram",
  "ngram_size": 3
}

Strict regression test

// Answers should closely match golden examples
{
  "threshold": 0.9,
  "tokenizer": "word"
}

Providing Expected Output

This evaluator requires an expectedOutput to compare against. There are two ways to provide it:

  • Datasets — Create a dataset with input/expected-output pairs. When you run evaluations against a dataset, 2Signal automatically supplies the expected output for each test case. This is the recommended approach for regression testing.
  • Trace metadata — Include expectedOutput in the metadata when sending traces via the SDK. Useful for real-time monitoring where you have a reference answer available at runtime.

If no expected output is available, the evaluator returns a score of 0.0 with reasoning explaining that the comparison could not be performed.

Scoring

Returns the cosine similarity score as a continuous value between 0 and 1. Label is "pass" if score >= threshold, "fail" otherwise. Unlike binary evaluators, Similarity gives you a gradient — you can see exactly how close the output was to the expected answer.

Performance

Similarity runs entirely locally with no external API calls. TF-IDF computation and cosine similarity typically complete in under 5ms, even for long texts. This makes it safe to run on every trace without impacting throughput or incurring additional costs.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord