Trace Replay
Trace replay lets you re-execute a trace with different models, prompt templates, or model parameters. The original LLM spans are extracted, re-run with your overrides, and a new trace is created. The new trace goes through the standard evaluation pipeline, so you can directly compare scores between the original and replayed versions.
Use Cases
- Model comparison — replay a trace with
gpt-4o-miniinstead ofgpt-4oto see if you can use a cheaper model without quality loss - Prompt optimization — replay with a new prompt template version to measure the impact of prompt changes
- Parameter tuning — replay with different temperature, max_tokens, or other model parameters
- Regression testing — after a model update, replay production traces to check for regressions
- Cost optimization — compare the cost of different models on real production inputs
How It Works
- Select a trace — choose a trace from the dashboard or via the tRPC API
- Configure overrides — specify model, prompt template version, and/or model parameter overrides
- Trigger replay — the trace-replay worker extracts all LLM spans from the original trace
- Re-execute — each LLM span is re-run with the original input but your overrides applied
- New trace created — a new trace is created with the replayed results
- Auto-evaluate — the new trace goes through the standard eval pipeline, so all enabled evaluators score it
- Compare — use the trace comparison view to see original vs replayed side-by-side
Replay via Dashboard
Navigate to any trace detail page. Click the Replay button and configure your overrides:
- Model override — select a different LLM model
- Prompt template version — select a specific version of a prompt template
- Model parameters — override temperature, max_tokens, top_p, etc.
After triggering, the replay status shows as PENDING → RUNNING → COMPLETED (or FAILED). Once complete, click through to the replayed trace to see results.
Replay History
Every trace keeps a history of all replays. View the replay history from the trace detail page to see all previous replay attempts, their configurations, and results.
Override Types
| Override | Description | Example |
|---|---|---|
model | Replace the LLM model for all LLM spans | Replay a gpt-4o trace with gpt-4o-mini |
promptTemplateVersionId | Use a specific prompt template version | Test a new prompt version against real production inputs |
modelParameters | Override model parameters (temperature, max_tokens, etc.) | Test with temperature: 0 for more deterministic outputs |
Comparing Results
Use the trace comparison feature to view original and replayed traces side-by-side. The comparison shows:
- Duration, token count, and cost deltas with trend indicators
- Input/output differences
- Span tree comparison
- Score comparison across all evaluators
Concurrency
The trace-replay worker processes replay jobs with a concurrency of 3. This means up to 3 replays can run simultaneously per worker instance. Each replay involves live LLM API calls, so execution time depends on the model and prompt complexity.
Limitations
- Only LLM spans are re-executed. Tool calls, retrieval steps, and custom spans use the original results.
- Replay requires the original trace to still be within your data retention window.
- LLM API keys must be configured on the server for the models you want to replay with.