Features

The full AI reliability platform

From tracing to evaluations to drift detection — everything you need to test, monitor, and ship reliable AI agents.

Tracing

Instrument in 3 lines of code

Add the @observe decorator and 2Signal captures every LLM call, tool invocation, and agent step automatically. Full span trees with latency, cost, and token breakdowns.

Hierarchical span trees with parent-child relationships
Automatic token counting and cost tracking per span
Zero latency overhead — async background export
Works with OpenAI, Anthropic, Cohere, Google, Mistral, Groq

Pythonagent.py

1import twosignal as ts2from twosignal import wrap_openai3from openai import OpenAI45ts.init(api_key="ts_...", project="my-agent")6client = wrap_openai(OpenAI())78@ts.observe()9def run_agent(query: str):10    intent = classify_intent(query)11    context = search_knowledge_base(intent)12    response = client.chat.completions.create(13        model="gpt-4o",14        messages=[{"role": "user", "content": query}],15    )16    return response.choices[0].message.content1718run_agent("How do I reset my password?")

Trace Detail

trc_8f2a9b1c

RUNNING

312ms

$0.0042

Span Tree

CustomerSupport.run()

AGENT

312ms

classify_intent()

LLM

86ms

search_knowledge_base()

TOOL

45ms

retrieve_context()

RETRIEVAL

38ms

generate_response()

LLM

168ms

Evaluations

Groundedness

0.94

Toxicity

0.02

PII Detection

0.00

Relevance

0.88

Latency SLA

1.00

Cost

0.42

SDKs

SDKs for every stack

First-class SDKs for Python, TypeScript, Go, Java, and Ruby. Same pattern everywhere: singleton client, background exporter, context propagation, never throws.

Python: @observe decorator, auto-patched LLM wrappers
TypeScript: @observe() decorator, AsyncLocalStorage context
Go: context.Context propagation, Span.Run() scoping
Java: ThreadLocal context, SpanHandle.run() pattern
Ruby: observe :method class-level decorator

Pythonagent.py

1import twosignal as ts2from twosignal import wrap_openai3from openai import OpenAI45ts.init(api_key="ts_...", project="my-agent")6client = wrap_openai(OpenAI())78@ts.observe()9def run_agent(query: str):10    intent = classify_intent(query)11    context = search_knowledge_base(intent)12    response = client.chat.completions.create(13        model="gpt-4o",14        messages=[{"role": "user", "content": query}],15    )16    return response.choices[0].message.content1718run_agent("How do I reset my password?")

Evaluators

25+ evaluators, zero config

Run deterministic checks, LLM-as-judge assessments, and safety scans on every trace or in batch. Mix and match to build your quality gate.

8 deterministic evaluators (regex, JSON schema, cost, latency SLA, etc.)
5 LLM-based evaluators (judge, groundedness, factual accuracy, etc.)
5 safety evaluators (prompt injection, PII, toxicity, bias, tool call)
Webhook evaluator for custom external evaluation logic

Deterministic

ContainsRegex MatchJSON SchemaExact MatchLatency SLACostSimilarityLevenshtein

LLM-based

LLM JudgeGroundednessFactual AccuracyCompliance CheckWorkflow Adherence

Safety

Prompt InjectionPII DetectionToxicityBias DetectionTool Call Validation

Evaluation Results

Evaluator	Type	Score	Threshold	Status
Groundedness	LLM-based	0.94	>=0.8	PASS
JSON Schema	Deterministic	1.00	valid	PASS
Toxicity	Safety	0.02	<=0.1	PASS
Regex Match	Deterministic	0.00	match	FAIL
Prompt Injection	Safety	0.00	<=0.05	PASS
Latency SLA	Deterministic	245ms	<=500ms	PASS
Bias Detection	LLM-based	0.08	<=0.15	PASS
Compliance	LLM-based	0.67	>=0.7	WARN

Safety

Safety guardrails built in

Detect prompt injection, PII leaks, toxicity, and bias before they reach your users. Automated alerts when thresholds are breached.

Hybrid deterministic + LLM prompt injection detection
PII detection across names, emails, phone numbers, SSNs
Multi-category bias detection (gender, race, age, disability, religion)
Regulatory compliance checking with configurable rule sets

Prompt Injection Detection

Hybrid deterministic + LLM detection catches both known patterns and novel injection attempts.

PII Detection

Automatically flag personally identifiable information leaking through your agent's responses.

Toxicity & Bias

Multi-category bias detection across gender, race, age, disability, and religion with confidence scores.

Compliance Check

Verify agent outputs meet regulatory requirements. Configurable rule sets for your industry.

Drift Detection

Catch drift before your users do

Statistical process control charts monitor your eval scores over time. Shewhart bands, CUSUM analysis, and Welch's t-tests detect when model quality shifts.

Shewhart control charts with 2-sigma and 3-sigma bands
CUSUM for detecting gradual score drift
Welch's t-test comparing baseline vs recent
Automatic alerts when drift magnitude exceeds thresholds

Drift Detection — Groundedness

DRIFT DETECTEDLast 24 days

A/B Testing

A/B test prompts with statistical rigor

Split traffic between prompt variants, collect scores, and let Welch's t-test determine the winner. Auto-complete when statistical significance is reached.

Weighted traffic splitting across prompt variants
Real-time score tracking with running statistics
Welch's t-test with p-values and 95% confidence intervals
Auto-complete when all variants reach 30+ scores

A/B Test — System Prompt v2

COMPLETEDp-value: 0.003

AConcise System PromptWINNER

60% traffic

Avg Score

0.87

Impressions

1,247

Scores

342

BDetailed System Prompt

40% traffic

Avg Score

0.79

Impressions

831

Scores

228

Variant A wins with 95% confidence. Absolute difference: +0.08 (+10.1%).

Model Routing

Cut costs with intelligent routing

Route simple queries to smaller, cheaper models automatically. Complexity-based routing reduces LLM spend by 30–50% without sacrificing quality.

Complexity scoring: length, vocabulary, question depth, structure
Rule-based model selection by priority with fallback chains
Per-trace cost tracking across all LLM providers
No code changes — configure routing rules in the dashboard

Model Routing

Cut costs with intelligent routing

Available on Growth and Team plans

Deployment Regression

Compare metrics across deployments

Tag traces with deployment versions and automatically compare error rates, latency, cost, and eval scores between releases.

Welch's t-test for continuous metrics (latency, cost, scores)
Two-proportion z-test for rates (error rate, eval pass rate)
Automatic comparison after 50+ traces on new deployment
CI/CD integration via REST API for deployment tracking

Deployment Regression

Compare metrics across deployments

Available on Growth and Team plans

Human Review

Human-in-the-loop quality assurance

Review queues for manual QA and annotation. Approve, reject, or skip traces, then export approved items to golden test datasets.

Sequential review mode with keyboard shortcuts (1/2/3 for sentiment)
Approve/reject/skip workflow with notes and labels
Export approved items to datasets for building golden test sets
Smart prioritization based on eval scores and error rates

Human Review

Human-in-the-loop quality assurance

Available on Growth and Team plans

Sessions

Group traces into conversations

Track multi-turn conversations and agent workflows as sessions. See the full context of how users interact with your agent over time.

Automatic session grouping by session ID
Session-level aggregation (total cost, duration, trace count)
Visual timeline of traces within a session
Works across all SDKs with simple session context

Sessions

Group traces into conversations

Available on Growth and Team plans

Alerts

Know when things break

Threshold-based alerts via email, Slack, or webhooks. Cooldown periods prevent alert storms. Monitor everything from error rates to eval drift.

11 alert metrics: error rate, latency, cost, eval scores, drift, PII, etc.
Email, Slack, and webhook delivery channels
Configurable time windows and cooldown periods
Auto-triggered after evaluation runs complete

Alerts

Know when things break

Available on Growth and Team plans

Custom Dashboards

Build your own dashboards

Create custom dashboards with configurable widget grids. Track the metrics that matter most to your team with 20+ chart types.

Configurable widget grids per project
Custom metric queries for any data in your project
Shareable dashboard URLs for stakeholders
Duplicate and customize pre-built templates

Custom Dashboards

Build your own dashboards

Available on Growth and Team plans

Ready to ship reliable AI agents?

Join thousands of developers who trust 2Signal for testing and monitoring their AI agents. Get started in under 5 minutes.

Get Started Free View Documentation

Free forever for small projects. No credit card required.