A/B Test Prompts
This recipe walks through running a prompt A/B test end-to-end: creating template versions, splitting traffic, collecting scores, and determining a statistically significant winner.
Prerequisites
- A 2Signal project with the Python or TypeScript SDK installed
- A prompt template with at least two versions
Step 1: Create Prompt Versions
In Prompt Templates, create or select a template. Push two versions:
# Version 1 — concise style
You are a helpful customer support agent. Answer the user's question in 1-2 sentences.
# Version 2 — detailed style
You are a helpful customer support agent. Answer the user's question thoroughly.
Include relevant details, links to documentation, and next steps.Step 2: Create an A/B Test
Go to A/B Tests → Create. Link each prompt version as a variant and set traffic weights:
Test: "Support Prompt Style Test"
Variant A: Version 1 (concise) — Weight: 50
Variant B: Version 2 (detailed) — Weight: 50Weights must sum to 100. Click Start to begin the test.
Step 3: Integrate with Your Agent
import twosignal
client = twosignal.TwoSignal(api_key="your-key")
# Get the variant for this request
variant = client.get_ab_test_variant("Support Prompt Style Test")
# Use the variant's prompt content
prompt = variant["content"]
# ... call your LLM with this prompt ...
# Record a quality score (e.g., from an evaluator or user feedback)
client.record_ab_test_score(
variant_id=variant["variant_id"],
score=0.9 # 0-1 scale
)The SDK calls GET /api/v1/ab-test?name=... to get the variant and POST /api/v1/ab-test to record scores. Variant selection is weighted random — 50/50 in this example.
Step 4: Monitor Results
Open the test detail page in the dashboard. You'll see:
- KPI cards — Total impressions, mean score, score count per variant
- Performance table — Side-by-side comparison of variants
- Statistical significance panel — Welch's t-test results with p-value, 95% CI, and winner recommendation
Step 5: Auto-Complete
The test auto-completes when:
- All variants have at least 30 scores
- Statistical significance is reached (p < 0.05)
You can also manually stop or complete the test at any time. Once complete, update your prompt template to use the winning version.
Tips
- Use consistent scoring — if you use LLM Judge scores, make sure the same evaluator prompt and model are used throughout the test.
- Run tests for at least 100 scores per variant for reliable results. The 30-score minimum is for auto-complete, not for confidence.
- Test one variable at a time. If you change both the system prompt and the temperature, you won't know which caused the difference.