Cost
Scores traces based on total cost. Set a budget — the score drops linearly as cost approaches the maximum.
How Cost Is Calculated
The cost evaluator sums the cost fields from all LLM spans in the trace. Each span's cost is calculated by the Python SDK automatically based on the model used and the token counts (input tokens + output tokens).
The SDK maintains an internal pricing table that maps model identifiers (e.g., gpt-4o, claude-sonnet-4-20250514) to per-token rates. When a span is recorded with usage data (prompt tokens, completion tokens), the SDK multiplies by the model's rate and attaches the cost to the span. The evaluator then sums these span-level costs to get the total trace cost.
Cost Sources
Cost data flows through the system as follows:
- Your agent makes LLM calls. The LLM provider returns token usage in the response.
- The 2Signal SDK captures the
modelname andusage(prompt_tokens, completion_tokens) from the LLM response. - The SDK looks up the model in its pricing table and calculates
cost = (input_tokens * input_rate) + (output_tokens * output_rate). - This cost is attached to the span's metadata and sent to the 2Signal API with the trace.
- The cost evaluator sums all span costs to produce the total trace cost, then scores it against your thresholds.
If a span does not have cost data (e.g., non-LLM spans like tool calls or custom spans), it contributes $0 to the total. Only spans with both a recognized model and token usage are costed.
Config
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
max_cost_usd | number | Yes | — | Max acceptable cost in USD (score = 0 above this) |
target_cost_usd | number | No | max / 2 | Ideal cost in USD (score = 1 below this) |
Example
{
"max_cost_usd": 0.05,
"target_cost_usd": 0.01
}Scoring
- Below
target_cost_usd: score = 1.0 - Between target and max: linear interpolation (1.0 → 0.0)
- Above
max_cost_usd: score = 0.0
Use Cases
- Budget enforcement — set a per-trace cost ceiling that matches your economics. If a single agent run costs more than your revenue from that interaction, you need to know immediately.
- Cost regression detection — catch when prompt changes, new tools, or model upgrades silently increase costs. A prompt that adds "think step by step" might double your token usage.
- Comparing model efficiency — run the same workload across different models and compare cost scores. A smaller model might score 1.0 on cost while still passing your quality evaluators.
- Optimizing agent loops — agents that retry or loop excessively burn tokens. Low cost scores highlight traces where the agent ran more steps than necessary.
Choosing Thresholds
Thresholds vary widely depending on the complexity of your agent's tasks:
| Use Case | target_cost_usd | max_cost_usd | Notes |
|---|---|---|---|
| Simple Q&A / classification | 0.001 | 0.01 | Single LLM call, short context |
| RAG with retrieval | 0.005 | 0.03 | Retrieval + generation, moderate context |
| Multi-step reasoning | 0.01 | 0.10 | Chain-of-thought, multiple LLM calls |
| Agents with tool calls | 0.05 | 0.50 | Iterative tool use, potentially many LLM calls |
| Complex multi-agent workflows | 0.10 | 1.00 | Multiple agents collaborating, large contexts |
Start by running without a cost evaluator, then check the cost distribution in your dashboard to set realistic thresholds based on actual data.
Alerts
Cost evaluator scores integrate with the 2Signal alerting system. When a trace scores below your configured alert threshold, you can receive notifications in the dashboard. This is especially useful for catching cost spikes in production before they impact your bill.
Combine cost alerts with usage monitoring (available on the Usage page in the dashboard) for full budget visibility. Usage monitoring tracks aggregate monthly spend, while the cost evaluator catches individual expensive traces.