Guide
Prompt A/B Testing
A/B testing lets you compare prompt template versions by splitting traffic and measuring per-variant scores with statistical rigor.
How It Works
- Create a test linking two or more prompt template versions, each with a traffic weight (weights must sum to 100).
- Start the test. SDK calls to
GET /api/v1/ab-test?name=...return a randomly selected variant based on weights. - As your agent runs, record scores for each variant via
POST /api/v1/ab-testwith the variant ID and score. - 2Signal computes running statistics (mean, variance) and performs Welch's t-test for statistical significance.
- When all variants have 30+ scores and significance is reached, the test auto-completes with a winner recommendation.
Test Lifecycle
| Status | Description |
|---|---|
DRAFT | Created but not yet running — configure variants and weights |
RUNNING | Actively splitting traffic and collecting scores |
STOPPED | Paused — can be resumed |
COMPLETED | Finished — winner determined or manually completed |
Statistical Significance
Results include Welch's t-test statistics:
- p-value — Probability that the difference is due to chance (significant at p < 0.05)
- 95% confidence interval — Range of the true difference between variants
- Absolute and relative difference — How much better the winning variant is
- Winner recommendation — Which variant to keep
SDK Integration
import twosignal
client = twosignal.TwoSignal(api_key="your-key")
# Get the variant to use for this request
variant = client.get_ab_test_variant("my-prompt-test")
prompt = variant["content"]
# ... run your agent with this prompt ...
# Record a score for this variant
client.record_ab_test_score(
variant_id=variant["variant_id"],
score=0.85
)Dashboard
The A/B test detail page shows KPIs per variant (impressions, mean score, score count), a variant performance comparison table, and a statistical significance panel with the t-test results.
You can also manually record scores or simulate test data from the dashboard for testing purposes.