Latency
Scores traces based on response time. Set a target and max threshold — the score interpolates linearly between them.
What Gets Measured
The latency evaluator measures the total trace duration: the difference between the end_time and start_time of the root span. This captures the full end-to-end time your agent took to produce a response, including all LLM calls, tool invocations, and any intermediate processing.
If your trace has multiple LLM spans (e.g., a chain of reasoning steps), the latency score reflects the cumulative wall-clock time, not individual span durations.
Config
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
max_ms | number | Yes | — | Max acceptable latency in ms (score = 0 above this) |
target_ms | number | No | max_ms / 2 | Ideal latency in ms (score = 1 below this) |
Example
{
"max_ms": 5000,
"target_ms": 1000
}Scoring
- Below
target_ms: score = 1.0 - Between
target_msandmax_ms: linear interpolation (1.0 → 0.0) - Above
max_ms: score = 0.0
How Scoring Works
The score is computed using a piecewise linear function:
if duration <= target_ms:
score = 1.0
elif duration >= max_ms:
score = 0.0
else:
score = max(0, 1 - (duration - target_ms) / (max_ms - target_ms))For example, with target_ms = 1000 and max_ms = 5000:
| Duration | Score |
|---|---|
| 500ms | 1.0 |
| 1000ms | 1.0 |
| 2000ms | 0.75 |
| 3000ms | 0.50 |
| 4000ms | 0.25 |
| 5000ms | 0.0 |
| 8000ms | 0.0 |
Use Cases
- SLA monitoring — enforce response-time guarantees for production agents. Set
max_msto your SLA limit and get alerted when traces breach it. - User experience thresholds — keep chatbot responses fast enough that users don't abandon the conversation. Studies show users expect sub-2-second responses for conversational AI.
- Comparing model latencies — run the same evaluator across traces from different models (e.g., GPT-4o vs Claude Sonnet) to quantify speed differences.
- Regression detection — track latency scores over time to catch regressions from prompt changes, new tool integrations, or provider-side slowdowns.
Choosing Thresholds
Thresholds depend on your use case. Here are recommended starting points:
| Use Case | target_ms | max_ms | Rationale |
|---|---|---|---|
| Chatbot / conversational | 1000 | 5000 | Users expect near-instant replies; 5s feels unresponsive |
| Real-time / autocomplete | 200 | 1000 | Must feel instantaneous; any perceptible delay breaks UX |
| Batch processing / pipelines | 10000 | 60000 | Throughput matters more than individual latency |
| Agentic workflows | 5000 | 30000 | Multi-step reasoning takes time; set generous but bounded limits |
| API endpoints | 500 | 3000 | Downstream services often have their own timeouts |
Start with generous thresholds and tighten them as you understand your baseline. Use the dashboard's latency distribution chart to see where most traces land.
Combining with Cost
Latency and cost often trade off against each other. Faster models tend to cost more, and techniques like caching reduce latency but increase infrastructure cost. Use both evaluators together to find the sweet spot:
- Attach both a Latency and Cost evaluator to the same project.
- Filter traces in the dashboard by those that score well on both — these represent your optimal configurations.
- If latency scores are high but cost scores are low, consider a smaller or cheaper model.
- If cost scores are high but latency scores are low, consider caching, streaming, or parallelizing LLM calls.