Features
The full AI reliability platform
From tracing to evaluations to drift detection — everything you need to test, monitor, and ship reliable AI agents.
Tracing
Instrument in 3 lines of code
Add the @observe decorator and 2Signal captures every LLM call, tool invocation, and agent step automatically. Full span trees with latency, cost, and token breakdowns.
- Hierarchical span trees with parent-child relationships
- Automatic token counting and cost tracking per span
- Zero latency overhead — async background export
- Works with OpenAI, Anthropic, Cohere, Google, Mistral, Groq
1import twosignal as ts2from twosignal import wrap_openai3from openai import OpenAI45ts.init(api_key="ts_...", project="my-agent")6client = wrap_openai(OpenAI())78@ts.observe()9def run_agent(query: str):10 intent = classify_intent(query)11 context = search_knowledge_base(intent)12 response = client.chat.completions.create(13 model="gpt-4o",14 messages=[{"role": "user", "content": query}],15 )16 return response.choices[0].message.content1718run_agent("How do I reset my password?")SDKs
SDKs for every stack
First-class SDKs for Python, TypeScript, Go, Java, and Ruby. Same pattern everywhere: singleton client, background exporter, context propagation, never throws.
- Python: @observe decorator, auto-patched LLM wrappers
- TypeScript: @observe() decorator, AsyncLocalStorage context
- Go: context.Context propagation, Span.Run() scoping
- Java: ThreadLocal context, SpanHandle.run() pattern
- Ruby: observe :method class-level decorator
1import twosignal as ts2from twosignal import wrap_openai3from openai import OpenAI45ts.init(api_key="ts_...", project="my-agent")6client = wrap_openai(OpenAI())78@ts.observe()9def run_agent(query: str):10 intent = classify_intent(query)11 context = search_knowledge_base(intent)12 response = client.chat.completions.create(13 model="gpt-4o",14 messages=[{"role": "user", "content": query}],15 )16 return response.choices[0].message.content1718run_agent("How do I reset my password?")Evaluators
25+ evaluators, zero config
Run deterministic checks, LLM-as-judge assessments, and safety scans on every trace or in batch. Mix and match to build your quality gate.
- 8 deterministic evaluators (regex, JSON schema, cost, latency SLA, etc.)
- 5 LLM-based evaluators (judge, groundedness, factual accuracy, etc.)
- 5 safety evaluators (prompt injection, PII, toxicity, bias, tool call)
- Webhook evaluator for custom external evaluation logic
Deterministic
LLM-based
Safety
| Evaluator | Type | Score | Threshold | Status |
|---|---|---|---|---|
Groundedness | LLM-based | 0.94 | >=0.8 | PASS |
JSON Schema | Deterministic | 1.00 | valid | PASS |
Toxicity | Safety | 0.02 | <=0.1 | PASS |
Regex Match | Deterministic | 0.00 | match | FAIL |
Prompt Injection | Safety | 0.00 | <=0.05 | PASS |
Latency SLA | Deterministic | 245ms | <=500ms | PASS |
Bias Detection | LLM-based | 0.08 | <=0.15 | PASS |
Compliance | LLM-based | 0.67 | >=0.7 | WARN |
Safety
Safety guardrails built in
Detect prompt injection, PII leaks, toxicity, and bias before they reach your users. Automated alerts when thresholds are breached.
- Hybrid deterministic + LLM prompt injection detection
- PII detection across names, emails, phone numbers, SSNs
- Multi-category bias detection (gender, race, age, disability, religion)
- Regulatory compliance checking with configurable rule sets
Prompt Injection Detection
Hybrid deterministic + LLM detection catches both known patterns and novel injection attempts.
PII Detection
Automatically flag personally identifiable information leaking through your agent's responses.
Toxicity & Bias
Multi-category bias detection across gender, race, age, disability, and religion with confidence scores.
Compliance Check
Verify agent outputs meet regulatory requirements. Configurable rule sets for your industry.
Drift Detection
Catch drift before your users do
Statistical process control charts monitor your eval scores over time. Shewhart bands, CUSUM analysis, and Welch's t-tests detect when model quality shifts.
- Shewhart control charts with 2-sigma and 3-sigma bands
- CUSUM for detecting gradual score drift
- Welch's t-test comparing baseline vs recent
- Automatic alerts when drift magnitude exceeds thresholds
A/B Testing
A/B test prompts with statistical rigor
Split traffic between prompt variants, collect scores, and let Welch's t-test determine the winner. Auto-complete when statistical significance is reached.
- Weighted traffic splitting across prompt variants
- Real-time score tracking with running statistics
- Welch's t-test with p-values and 95% confidence intervals
- Auto-complete when all variants reach 30+ scores
0.87
1,247
342
0.79
831
228
Variant A wins with 95% confidence. Absolute difference: +0.08 (+10.1%).
Model Routing
Cut costs with intelligent routing
Route simple queries to smaller, cheaper models automatically. Complexity-based routing reduces LLM spend by 30–50% without sacrificing quality.
- Complexity scoring: length, vocabulary, question depth, structure
- Rule-based model selection by priority with fallback chains
- Per-trace cost tracking across all LLM providers
- No code changes — configure routing rules in the dashboard
Cut costs with intelligent routing
Available on Growth and Team plans
Deployment Regression
Compare metrics across deployments
Tag traces with deployment versions and automatically compare error rates, latency, cost, and eval scores between releases.
- Welch's t-test for continuous metrics (latency, cost, scores)
- Two-proportion z-test for rates (error rate, eval pass rate)
- Automatic comparison after 50+ traces on new deployment
- CI/CD integration via REST API for deployment tracking
Compare metrics across deployments
Available on Growth and Team plans
Human Review
Human-in-the-loop quality assurance
Review queues for manual QA and annotation. Approve, reject, or skip traces, then export approved items to golden test datasets.
- Sequential review mode with keyboard shortcuts (1/2/3 for sentiment)
- Approve/reject/skip workflow with notes and labels
- Export approved items to datasets for building golden test sets
- Smart prioritization based on eval scores and error rates
Human-in-the-loop quality assurance
Available on Growth and Team plans
Sessions
Group traces into conversations
Track multi-turn conversations and agent workflows as sessions. See the full context of how users interact with your agent over time.
- Automatic session grouping by session ID
- Session-level aggregation (total cost, duration, trace count)
- Visual timeline of traces within a session
- Works across all SDKs with simple session context
Group traces into conversations
Available on Growth and Team plans
Alerts
Know when things break
Threshold-based alerts via email, Slack, or webhooks. Cooldown periods prevent alert storms. Monitor everything from error rates to eval drift.
- 11 alert metrics: error rate, latency, cost, eval scores, drift, PII, etc.
- Email, Slack, and webhook delivery channels
- Configurable time windows and cooldown periods
- Auto-triggered after evaluation runs complete
Know when things break
Available on Growth and Team plans
Custom Dashboards
Build your own dashboards
Create custom dashboards with configurable widget grids. Track the metrics that matter most to your team with 20+ chart types.
- Configurable widget grids per project
- Custom metric queries for any data in your project
- Shareable dashboard URLs for stakeholders
- Duplicate and customize pre-built templates
Build your own dashboards
Available on Growth and Team plans
Ready to ship reliable AI agents?
Join thousands of developers who trust 2Signal for testing and monitoring their AI agents. Get started in under 5 minutes.
Free forever for small projects. No credit card required.