Features

The full AI reliability platform

From tracing to evaluations to drift detection — everything you need to test, monitor, and ship reliable AI agents.

Tracing

Instrument in 3 lines of code

Add the @observe decorator and 2Signal captures every LLM call, tool invocation, and agent step automatically. Full span trees with latency, cost, and token breakdowns.

  • Hierarchical span trees with parent-child relationships
  • Automatic token counting and cost tracking per span
  • Zero latency overhead — async background export
  • Works with OpenAI, Anthropic, Cohere, Google, Mistral, Groq
Pythonagent.py
1import twosignal as ts2from twosignal import wrap_openai3from openai import OpenAI45ts.init(api_key="ts_...", project="my-agent")6client = wrap_openai(OpenAI())78@ts.observe()9def run_agent(query: str):10    intent = classify_intent(query)11    context = search_knowledge_base(intent)12    response = client.chat.completions.create(13        model="gpt-4o",14        messages=[{"role": "user", "content": query}],15    )16    return response.choices[0].message.content1718run_agent("How do I reset my password?")
Trace Detail
trc_8f2a9b1c
RUNNING
312ms
$0.0042
Span Tree
CustomerSupport.run()
AGENT312ms
classify_intent()
LLM86ms
search_knowledge_base()
TOOL45ms
retrieve_context()
RETRIEVAL38ms
generate_response()
LLM168ms
Evaluations
Groundedness
0.94
Toxicity
0.02
PII Detection
0.00
Relevance
0.88
Latency SLA
1.00
Cost
0.42

SDKs

SDKs for every stack

First-class SDKs for Python, TypeScript, Go, Java, and Ruby. Same pattern everywhere: singleton client, background exporter, context propagation, never throws.

  • Python: @observe decorator, auto-patched LLM wrappers
  • TypeScript: @observe() decorator, AsyncLocalStorage context
  • Go: context.Context propagation, Span.Run() scoping
  • Java: ThreadLocal context, SpanHandle.run() pattern
  • Ruby: observe :method class-level decorator
Pythonagent.py
1import twosignal as ts2from twosignal import wrap_openai3from openai import OpenAI45ts.init(api_key="ts_...", project="my-agent")6client = wrap_openai(OpenAI())78@ts.observe()9def run_agent(query: str):10    intent = classify_intent(query)11    context = search_knowledge_base(intent)12    response = client.chat.completions.create(13        model="gpt-4o",14        messages=[{"role": "user", "content": query}],15    )16    return response.choices[0].message.content1718run_agent("How do I reset my password?")

Evaluators

25+ evaluators, zero config

Run deterministic checks, LLM-as-judge assessments, and safety scans on every trace or in batch. Mix and match to build your quality gate.

  • 8 deterministic evaluators (regex, JSON schema, cost, latency SLA, etc.)
  • 5 LLM-based evaluators (judge, groundedness, factual accuracy, etc.)
  • 5 safety evaluators (prompt injection, PII, toxicity, bias, tool call)
  • Webhook evaluator for custom external evaluation logic

Deterministic

ContainsRegex MatchJSON SchemaExact MatchLatency SLACostSimilarityLevenshtein

LLM-based

LLM JudgeGroundednessFactual AccuracyCompliance CheckWorkflow Adherence

Safety

Prompt InjectionPII DetectionToxicityBias DetectionTool Call Validation
Evaluation Results
EvaluatorTypeScoreThresholdStatus
Groundedness
LLM-based0.94>=0.8PASS
JSON Schema
Deterministic1.00validPASS
Toxicity
Safety0.02<=0.1PASS
Regex Match
Deterministic0.00matchFAIL
Prompt Injection
Safety0.00<=0.05PASS
Latency SLA
Deterministic245ms<=500msPASS
Bias Detection
LLM-based0.08<=0.15PASS
Compliance
LLM-based0.67>=0.7WARN

Safety

Safety guardrails built in

Detect prompt injection, PII leaks, toxicity, and bias before they reach your users. Automated alerts when thresholds are breached.

  • Hybrid deterministic + LLM prompt injection detection
  • PII detection across names, emails, phone numbers, SSNs
  • Multi-category bias detection (gender, race, age, disability, religion)
  • Regulatory compliance checking with configurable rule sets

Prompt Injection Detection

Hybrid deterministic + LLM detection catches both known patterns and novel injection attempts.

PII Detection

Automatically flag personally identifiable information leaking through your agent's responses.

Toxicity & Bias

Multi-category bias detection across gender, race, age, disability, and religion with confidence scores.

Compliance Check

Verify agent outputs meet regulatory requirements. Configurable rule sets for your industry.

Drift Detection

Catch drift before your users do

Statistical process control charts monitor your eval scores over time. Shewhart bands, CUSUM analysis, and Welch's t-tests detect when model quality shifts.

  • Shewhart control charts with 2-sigma and 3-sigma bands
  • CUSUM for detecting gradual score drift
  • Welch's t-test comparing baseline vs recent
  • Automatic alerts when drift magnitude exceeds thresholds
Drift Detection — Groundedness
DRIFT DETECTEDLast 24 days
mean2sigma3sigma

A/B Testing

A/B test prompts with statistical rigor

Split traffic between prompt variants, collect scores, and let Welch's t-test determine the winner. Auto-complete when statistical significance is reached.

  • Weighted traffic splitting across prompt variants
  • Real-time score tracking with running statistics
  • Welch's t-test with p-values and 95% confidence intervals
  • Auto-complete when all variants reach 30+ scores
A/B Test — System Prompt v2
COMPLETEDp-value: 0.003
AConcise System PromptWINNER
60% traffic
Avg Score

0.87

Impressions

1,247

Scores

342

BDetailed System Prompt
40% traffic
Avg Score

0.79

Impressions

831

Scores

228

Variant A wins with 95% confidence. Absolute difference: +0.08 (+10.1%).

Model Routing

Cut costs with intelligent routing

Route simple queries to smaller, cheaper models automatically. Complexity-based routing reduces LLM spend by 30–50% without sacrificing quality.

  • Complexity scoring: length, vocabulary, question depth, structure
  • Rule-based model selection by priority with fallback chains
  • Per-trace cost tracking across all LLM providers
  • No code changes — configure routing rules in the dashboard
Model Routing

Cut costs with intelligent routing

Available on Growth and Team plans

Deployment Regression

Compare metrics across deployments

Tag traces with deployment versions and automatically compare error rates, latency, cost, and eval scores between releases.

  • Welch's t-test for continuous metrics (latency, cost, scores)
  • Two-proportion z-test for rates (error rate, eval pass rate)
  • Automatic comparison after 50+ traces on new deployment
  • CI/CD integration via REST API for deployment tracking
Deployment Regression

Compare metrics across deployments

Available on Growth and Team plans

Human Review

Human-in-the-loop quality assurance

Review queues for manual QA and annotation. Approve, reject, or skip traces, then export approved items to golden test datasets.

  • Sequential review mode with keyboard shortcuts (1/2/3 for sentiment)
  • Approve/reject/skip workflow with notes and labels
  • Export approved items to datasets for building golden test sets
  • Smart prioritization based on eval scores and error rates
Human Review

Human-in-the-loop quality assurance

Available on Growth and Team plans

Sessions

Group traces into conversations

Track multi-turn conversations and agent workflows as sessions. See the full context of how users interact with your agent over time.

  • Automatic session grouping by session ID
  • Session-level aggregation (total cost, duration, trace count)
  • Visual timeline of traces within a session
  • Works across all SDKs with simple session context
Sessions

Group traces into conversations

Available on Growth and Team plans

Alerts

Know when things break

Threshold-based alerts via email, Slack, or webhooks. Cooldown periods prevent alert storms. Monitor everything from error rates to eval drift.

  • 11 alert metrics: error rate, latency, cost, eval scores, drift, PII, etc.
  • Email, Slack, and webhook delivery channels
  • Configurable time windows and cooldown periods
  • Auto-triggered after evaluation runs complete
Alerts

Know when things break

Available on Growth and Team plans

Custom Dashboards

Build your own dashboards

Create custom dashboards with configurable widget grids. Track the metrics that matter most to your team with 20+ chart types.

  • Configurable widget grids per project
  • Custom metric queries for any data in your project
  • Shareable dashboard URLs for stakeholders
  • Duplicate and customize pre-built templates
Custom Dashboards

Build your own dashboards

Available on Growth and Team plans

Ready to ship reliable AI agents?

Join thousands of developers who trust 2Signal for testing and monitoring their AI agents. Get started in under 5 minutes.

Free forever for small projects. No credit card required.