Guide

Trace Replay

Trace replay lets you re-execute a trace with different models, prompt templates, or model parameters. The original LLM spans are extracted, re-run with your overrides, and a new trace is created. The new trace goes through the standard evaluation pipeline, so you can directly compare scores between the original and replayed versions.

Use Cases

Model comparison — replay a trace with gpt-4o-mini instead of gpt-4o to see if you can use a cheaper model without quality loss
Prompt optimization — replay with a new prompt template version to measure the impact of prompt changes
Parameter tuning — replay with different temperature, max_tokens, or other model parameters
Regression testing — after a model update, replay production traces to check for regressions
Cost optimization — compare the cost of different models on real production inputs

How It Works

Select a trace — choose a trace from the dashboard or via the tRPC API
Configure overrides — specify model, prompt template version, and/or model parameter overrides
Trigger replay — the trace-replay worker extracts all LLM spans from the original trace
Re-execute — each LLM span is re-run with the original input but your overrides applied
New trace created — a new trace is created with the replayed results
Auto-evaluate — the new trace goes through the standard eval pipeline, so all enabled evaluators score it
Compare — use the trace comparison view to see original vs replayed side-by-side

Replay via Dashboard

Navigate to any trace detail page. Click the Replay button and configure your overrides:

Model override — select a different LLM model
Prompt template version — select a specific version of a prompt template
Model parameters — override temperature, max_tokens, top_p, etc.

After triggering, the replay status shows as PENDING → RUNNING → COMPLETED (or FAILED). Once complete, click through to the replayed trace to see results.

Replay History

Every trace keeps a history of all replays. View the replay history from the trace detail page to see all previous replay attempts, their configurations, and results.

Override Types

Override	Description	Example
`model`	Replace the LLM model for all LLM spans	Replay a `gpt-4o` trace with `gpt-4o-mini`
`promptTemplateVersionId`	Use a specific prompt template version	Test a new prompt version against real production inputs
`modelParameters`	Override model parameters (temperature, max_tokens, etc.)	Test with `temperature: 0` for more deterministic outputs

Comparing Results

Use the trace comparison feature to view original and replayed traces side-by-side. The comparison shows:

Duration, token count, and cost deltas with trend indicators
Input/output differences
Span tree comparison
Score comparison across all evaluators

Concurrency

The trace-replay worker processes replay jobs with a concurrency of 3. This means up to 3 replays can run simultaneously per worker instance. Each replay involves live LLM API calls, so execution time depends on the model and prompt complexity.

Limitations

Only LLM spans are re-executed. Tool calls, retrieval steps, and custom spans use the original results.
Replay requires the original trace to still be within your data retention window.
LLM API keys must be configured on the server for the models you want to replay with.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord