Latency

Scores traces based on response time. Set a target and max threshold — the score interpolates linearly between them.

What Gets Measured

The latency evaluator measures the total trace duration: the difference between the end_time and start_time of the root span. This captures the full end-to-end time your agent took to produce a response, including all LLM calls, tool invocations, and any intermediate processing.

If your trace has multiple LLM spans (e.g., a chain of reasoning steps), the latency score reflects the cumulative wall-clock time, not individual span durations.

Config

FieldTypeRequiredDefaultDescription
max_msnumberYesMax acceptable latency in ms (score = 0 above this)
target_msnumberNomax_ms / 2Ideal latency in ms (score = 1 below this)

Example

{
  "max_ms": 5000,
  "target_ms": 1000
}

Scoring

  • Below target_ms: score = 1.0
  • Between target_ms and max_ms: linear interpolation (1.0 → 0.0)
  • Above max_ms: score = 0.0

How Scoring Works

The score is computed using a piecewise linear function:

if duration <= target_ms:
    score = 1.0

elif duration >= max_ms:
    score = 0.0

else:
    score = max(0, 1 - (duration - target_ms) / (max_ms - target_ms))

For example, with target_ms = 1000 and max_ms = 5000:

DurationScore
500ms1.0
1000ms1.0
2000ms0.75
3000ms0.50
4000ms0.25
5000ms0.0
8000ms0.0

Use Cases

  • SLA monitoring — enforce response-time guarantees for production agents. Set max_ms to your SLA limit and get alerted when traces breach it.
  • User experience thresholds — keep chatbot responses fast enough that users don't abandon the conversation. Studies show users expect sub-2-second responses for conversational AI.
  • Comparing model latencies — run the same evaluator across traces from different models (e.g., GPT-4o vs Claude Sonnet) to quantify speed differences.
  • Regression detection — track latency scores over time to catch regressions from prompt changes, new tool integrations, or provider-side slowdowns.

Choosing Thresholds

Thresholds depend on your use case. Here are recommended starting points:

Use Casetarget_msmax_msRationale
Chatbot / conversational10005000Users expect near-instant replies; 5s feels unresponsive
Real-time / autocomplete2001000Must feel instantaneous; any perceptible delay breaks UX
Batch processing / pipelines1000060000Throughput matters more than individual latency
Agentic workflows500030000Multi-step reasoning takes time; set generous but bounded limits
API endpoints5003000Downstream services often have their own timeouts

Start with generous thresholds and tighten them as you understand your baseline. Use the dashboard's latency distribution chart to see where most traces land.

Combining with Cost

Latency and cost often trade off against each other. Faster models tend to cost more, and techniques like caching reduce latency but increase infrastructure cost. Use both evaluators together to find the sweet spot:

  • Attach both a Latency and Cost evaluator to the same project.
  • Filter traces in the dashboard by those that score well on both — these represent your optimal configurations.
  • If latency scores are high but cost scores are low, consider a smaller or cheaper model.
  • If cost scores are high but latency scores are low, consider caching, streaming, or parallelizing LLM calls.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord