Testing
Trace Replay Testing
Use real production traces to A/B test models, prompts, and configurations — without affecting users.
What trace replay does
- Takes an existing trace with LLM spans
- Re-executes the LLM calls with your overrides (model, prompt version, parameters)
- Creates a brand new trace from the replayed execution
- Runs your full eval pipeline on the new trace
- You compare original vs. replayed results
A/B testing models
Scenario: You're using gpt-4o in production but want to test if a cheaper model gives acceptable quality.
Steps
- Select 20-50 representative production traces
- Replay each with model override:
gpt-4o-mini - Compare evaluator scores between original (gpt-4o) and replay (gpt-4o-mini)
- If quality holds, route simple queries to the cheaper model
What to compare
| Metric | Original (gpt-4o) | Replay (gpt-4o-mini) | Delta |
|---|---|---|---|
| Helpfulness (LLM Judge) | 0.88 | 0.82 | -0.06 |
| Format check (JSON Schema) | 1.00 | 0.98 | -0.02 |
| Avg cost | $0.045 | $0.008 | -82% |
| Avg latency | 2.1s | 0.9s | -57% |
A/B testing prompts
Scenario: You've written a new prompt (v2) and want to validate it before deploying.
Steps
- Push the new prompt version via the API
- Select production traces that used the current prompt
- Replay with prompt template version override
- Compare scores — especially LLM Judge and Similarity
A/B testing parameters
Test different temperature, max_tokens, or top_p settings:
- Lower temperature: more consistent but less creative
- Higher max_tokens: more complete but more expensive
- Find the sweet spot for your use case
Batch replay workflow
- Go to the Traces page, filter to your target population
- Select multiple traces
- Click “Batch Replay”
- Configure overrides
- Review aggregate results when complete
Statistical significance
- Replay at least 30 traces for meaningful comparison
- Look for consistent patterns, not individual outliers
- Pay attention to the score distribution, not just averages
- If results are mixed, increase your sample size
Tips
- Focus replays on diverse traces — don't just replay the easy ones
- Check for format regressions first (free evaluators) before analyzing semantic quality
- Keep a record of your replay experiments and decisions
- Replay periodically after model provider updates — their models change too
See the Trace Replay Cookbook and Trace Replay Guide for more details.