Cookbook

Trace Replay & A/B Testing

Re-run production traces with different models, prompts, or parameters — then compare the results automatically. Trace replay lets you A/B test changes against real-world inputs without redeploying your agent.

How It Works

Pick a trace (or set of traces) from the dashboard.
Configure overrides: model, prompt template version, or model parameters (temperature, max_tokens, etc.).
2Signal extracts the LLM spans from the original trace, re-executes them with your overrides, and creates a new trace.
The new trace goes through the full eval pipeline — you get scores automatically.
Compare original vs. replayed trace side-by-side.

Use Case 1: Model Comparison

You have production traces using gpt-4o. You want to test whether gpt-4o-mini gives similar quality at lower cost. Replay your traces with the model override set to gpt-4o-mini and compare evaluator scores across both runs. You get a direct, apples-to-apples comparison on the same inputs.

Use Case 2: Prompt Iteration

Push a new prompt template version (v2) via the REST API or dashboard. Then replay a set of traces with the new prompt version and compare scores against v1. This tells you exactly how the prompt change affects quality before you roll it out to production.

Use Case 3: Parameter Tuning

Replay traces with different temperature or max_tokens settings to find the optimal configuration. For example, test whether lowering temperature from 0.7 to 0.3 improves consistency on your evaluators without hurting creativity scores.

Replaying from the Dashboard

Go to a trace detail page.
Click Replay.
Select overrides (model, prompt version, parameters).
Review results in the replay history tab.

Comparing Results

Use the trace comparison view to see original vs. replay side-by-side.
Score deltas are shown for each evaluator.
Cost and latency differences are highlighted.

Tips

Start with your highest-traffic traces for the most representative A/B test.
Run replays against a dataset for batch A/B testing.
Use structural evaluators (CONTAINS, REGEX_MATCH, JSON_SCHEMA) to catch format regressions before checking semantic quality with LLM_JUDGE.
Track replay history per trace to see how different configurations perform over time.

What's Next

Trace Replay Guide — Full reference on replay configuration, status tracking, and batch replays.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord