Latency

Scores traces based on response time. Set a target and max threshold — the score interpolates linearly between them.

What Gets Measured

The latency evaluator measures the total trace duration: the difference between the end_time and start_time of the root span. This captures the full end-to-end time your agent took to produce a response, including all LLM calls, tool invocations, and any intermediate processing.

If your trace has multiple LLM spans (e.g., a chain of reasoning steps), the latency score reflects the cumulative wall-clock time, not individual span durations.

Config

Field	Type	Required	Default	Description
`max_ms`	number	Yes	—	Max acceptable latency in ms (score = 0 above this)
`target_ms`	number	No	`max_ms / 2`	Ideal latency in ms (score = 1 below this)

Example

{
  "max_ms": 5000,
  "target_ms": 1000
}

Scoring

Below target_ms: score = 1.0
Between target_ms and max_ms: linear interpolation (1.0 → 0.0)
Above max_ms: score = 0.0

How Scoring Works

The score is computed using a piecewise linear function:

if duration <= target_ms:
    score = 1.0

elif duration >= max_ms:
    score = 0.0

else:
    score = max(0, 1 - (duration - target_ms) / (max_ms - target_ms))

For example, with target_ms = 1000 and max_ms = 5000:

Duration	Score
500ms	1.0
1000ms	1.0
2000ms	0.75
3000ms	0.50
4000ms	0.25
5000ms	0.0
8000ms	0.0

Use Cases

SLA monitoring — enforce response-time guarantees for production agents. Set max_ms to your SLA limit and get alerted when traces breach it.
User experience thresholds — keep chatbot responses fast enough that users don't abandon the conversation. Studies show users expect sub-2-second responses for conversational AI.
Comparing model latencies — run the same evaluator across traces from different models (e.g., GPT-4o vs Claude Sonnet) to quantify speed differences.
Regression detection — track latency scores over time to catch regressions from prompt changes, new tool integrations, or provider-side slowdowns.

Choosing Thresholds

Thresholds depend on your use case. Here are recommended starting points:

Use Case	target_ms	max_ms	Rationale
Chatbot / conversational	1000	5000	Users expect near-instant replies; 5s feels unresponsive
Real-time / autocomplete	200	1000	Must feel instantaneous; any perceptible delay breaks UX
Batch processing / pipelines	10000	60000	Throughput matters more than individual latency
Agentic workflows	5000	30000	Multi-step reasoning takes time; set generous but bounded limits
API endpoints	500	3000	Downstream services often have their own timeouts

Start with generous thresholds and tighten them as you understand your baseline. Use the dashboard's latency distribution chart to see where most traces land.

Combining with Cost

Latency and cost often trade off against each other. Faster models tend to cost more, and techniques like caching reduce latency but increase infrastructure cost. Use both evaluators together to find the sweet spot:

Attach both a Latency and Cost evaluator to the same project.
Filter traces in the dashboard by those that score well on both — these represent your optimal configurations.
If latency scores are high but cost scores are low, consider a smaller or cheaper model.
If cost scores are high but latency scores are low, consider caching, streaming, or parallelizing LLM calls.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord