Testing

Trace Replay Testing

Use real production traces to A/B test models, prompts, and configurations — without affecting users.

What trace replay does

Takes an existing trace with LLM spans
Re-executes the LLM calls with your overrides (model, prompt version, parameters)
Creates a brand new trace from the replayed execution
Runs your full eval pipeline on the new trace
You compare original vs. replayed results

Scenario: You're using gpt-4o in production but want to test if a cheaper model gives acceptable quality.

Metric	Original (gpt-4o)	Replay (gpt-4o-mini)	Delta
Helpfulness (LLM Judge)	0.88	0.82	-0.06
Format check (JSON Schema)	1.00	0.98	-0.02
Avg cost	$0.045	$0.008	-82%
Avg latency	2.1s	0.9s	-57%

Scenario: You've written a new prompt (v2) and want to validate it before deploying.

Test different temperature, max_tokens, or top_p settings:

Focus replays on diverse traces — don't just replay the easy ones
Check for format regressions first (free evaluators) before analyzing semantic quality
Keep a record of your replay experiments and decisions
Replay periodically after model provider updates — their models change too

Have questions? Join our community.

Connect with other developers and the 2Signal team.