Cookbook

A/B Test Prompts

This recipe walks through running a prompt A/B test end-to-end: creating template versions, splitting traffic, collecting scores, and determining a statistically significant winner.

Prerequisites

A 2Signal project with the Python or TypeScript SDK installed
A prompt template with at least two versions

Step 1: Create Prompt Versions

In Prompt Templates, create or select a template. Push two versions:

# Version 1 — concise style
You are a helpful customer support agent. Answer the user's question in 1-2 sentences.

# Version 2 — detailed style
You are a helpful customer support agent. Answer the user's question thoroughly.
Include relevant details, links to documentation, and next steps.

Step 2: Create an A/B Test

Go to A/B Tests → Create. Link each prompt version as a variant and set traffic weights:

Test: "Support Prompt Style Test"
Variant A: Version 1 (concise) — Weight: 50
Variant B: Version 2 (detailed) — Weight: 50

Weights must sum to 100. Click Start to begin the test.

Step 3: Integrate with Your Agent

import twosignal

client = twosignal.TwoSignal(api_key="your-key")

# Get the variant for this request
variant = client.get_ab_test_variant("Support Prompt Style Test")

# Use the variant's prompt content
prompt = variant["content"]
# ... call your LLM with this prompt ...

# Record a quality score (e.g., from an evaluator or user feedback)
client.record_ab_test_score(
    variant_id=variant["variant_id"],
    score=0.9  # 0-1 scale
)

The SDK calls GET /api/v1/ab-test?name=... to get the variant and POST /api/v1/ab-test to record scores. Variant selection is weighted random — 50/50 in this example.

Step 4: Monitor Results

Open the test detail page in the dashboard. You'll see:

KPI cards — Total impressions, mean score, score count per variant
Performance table — Side-by-side comparison of variants
Statistical significance panel — Welch's t-test results with p-value, 95% CI, and winner recommendation

Step 5: Auto-Complete

The test auto-completes when:

All variants have at least 30 scores
Statistical significance is reached (p < 0.05)

You can also manually stop or complete the test at any time. Once complete, update your prompt template to use the winning version.

Tips

Use consistent scoring — if you use LLM Judge scores, make sure the same evaluator prompt and model are used throughout the test.
Run tests for at least 100 scores per variant for reliable results. The 30-score minimum is for auto-complete, not for confidence.
Test one variable at a time. If you change both the system prompt and the temperature, you won't know which caused the difference.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord