Guide

Trace Replay

Trace replay lets you re-execute a trace with different models, prompt templates, or model parameters. The original LLM spans are extracted, re-run with your overrides, and a new trace is created. The new trace goes through the standard evaluation pipeline, so you can directly compare scores between the original and replayed versions.

Use Cases

  • Model comparison — replay a trace with gpt-4o-mini instead of gpt-4o to see if you can use a cheaper model without quality loss
  • Prompt optimization — replay with a new prompt template version to measure the impact of prompt changes
  • Parameter tuning — replay with different temperature, max_tokens, or other model parameters
  • Regression testing — after a model update, replay production traces to check for regressions
  • Cost optimization — compare the cost of different models on real production inputs

How It Works

  1. Select a trace — choose a trace from the dashboard or via the tRPC API
  2. Configure overrides — specify model, prompt template version, and/or model parameter overrides
  3. Trigger replay — the trace-replay worker extracts all LLM spans from the original trace
  4. Re-execute — each LLM span is re-run with the original input but your overrides applied
  5. New trace created — a new trace is created with the replayed results
  6. Auto-evaluate — the new trace goes through the standard eval pipeline, so all enabled evaluators score it
  7. Compare — use the trace comparison view to see original vs replayed side-by-side

Replay via Dashboard

Navigate to any trace detail page. Click the Replay button and configure your overrides:

  • Model override — select a different LLM model
  • Prompt template version — select a specific version of a prompt template
  • Model parameters — override temperature, max_tokens, top_p, etc.

After triggering, the replay status shows as PENDINGRUNNINGCOMPLETED (or FAILED). Once complete, click through to the replayed trace to see results.

Replay History

Every trace keeps a history of all replays. View the replay history from the trace detail page to see all previous replay attempts, their configurations, and results.

Override Types

OverrideDescriptionExample
modelReplace the LLM model for all LLM spansReplay a gpt-4o trace with gpt-4o-mini
promptTemplateVersionIdUse a specific prompt template versionTest a new prompt version against real production inputs
modelParametersOverride model parameters (temperature, max_tokens, etc.)Test with temperature: 0 for more deterministic outputs

Comparing Results

Use the trace comparison feature to view original and replayed traces side-by-side. The comparison shows:

  • Duration, token count, and cost deltas with trend indicators
  • Input/output differences
  • Span tree comparison
  • Score comparison across all evaluators

Concurrency

The trace-replay worker processes replay jobs with a concurrency of 3. This means up to 3 replays can run simultaneously per worker instance. Each replay involves live LLM API calls, so execution time depends on the model and prompt complexity.

Limitations

  • Only LLM spans are re-executed. Tool calls, retrieval steps, and custom spans use the original results.
  • Replay requires the original trace to still be within your data retention window.
  • LLM API keys must be configured on the server for the models you want to replay with.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord