Testing

Trace Replay Testing

Use real production traces to A/B test models, prompts, and configurations — without affecting users.

What trace replay does

  1. Takes an existing trace with LLM spans
  2. Re-executes the LLM calls with your overrides (model, prompt version, parameters)
  3. Creates a brand new trace from the replayed execution
  4. Runs your full eval pipeline on the new trace
  5. You compare original vs. replayed results

A/B testing models

Scenario: You're using gpt-4o in production but want to test if a cheaper model gives acceptable quality.

Steps

  1. Select 20-50 representative production traces
  2. Replay each with model override: gpt-4o-mini
  3. Compare evaluator scores between original (gpt-4o) and replay (gpt-4o-mini)
  4. If quality holds, route simple queries to the cheaper model

What to compare

MetricOriginal (gpt-4o)Replay (gpt-4o-mini)Delta
Helpfulness (LLM Judge)0.880.82-0.06
Format check (JSON Schema)1.000.98-0.02
Avg cost$0.045$0.008-82%
Avg latency2.1s0.9s-57%

A/B testing prompts

Scenario: You've written a new prompt (v2) and want to validate it before deploying.

Steps

  1. Push the new prompt version via the API
  2. Select production traces that used the current prompt
  3. Replay with prompt template version override
  4. Compare scores — especially LLM Judge and Similarity

A/B testing parameters

Test different temperature, max_tokens, or top_p settings:

  • Lower temperature: more consistent but less creative
  • Higher max_tokens: more complete but more expensive
  • Find the sweet spot for your use case

Batch replay workflow

  1. Go to the Traces page, filter to your target population
  2. Select multiple traces
  3. Click “Batch Replay”
  4. Configure overrides
  5. Review aggregate results when complete

Statistical significance

  • Replay at least 30 traces for meaningful comparison
  • Look for consistent patterns, not individual outliers
  • Pay attention to the score distribution, not just averages
  • If results are mixed, increase your sample size

Tips

  • Focus replays on diverse traces — don't just replay the easy ones
  • Check for format regressions first (free evaluators) before analyzing semantic quality
  • Keep a record of your replay experiments and decisions
  • Replay periodically after model provider updates — their models change too

See the Trace Replay Cookbook and Trace Replay Guide for more details.

Have questions? Join our community!

Connect with other developers and the 2Signal team.

Join Discord