Cost

Scores traces based on total cost. Set a budget — the score drops linearly as cost approaches the maximum.

How Cost Is Calculated

The cost evaluator sums the cost fields from all LLM spans in the trace. Each span's cost is calculated by the Python SDK automatically based on the model used and the token counts (input tokens + output tokens).

The SDK maintains an internal pricing table that maps model identifiers (e.g., gpt-4o, claude-sonnet-4-20250514) to per-token rates. When a span is recorded with usage data (prompt tokens, completion tokens), the SDK multiplies by the model's rate and attaches the cost to the span. The evaluator then sums these span-level costs to get the total trace cost.

Cost Sources

Cost data flows through the system as follows:

Your agent makes LLM calls. The LLM provider returns token usage in the response.
The 2Signal SDK captures the model name and usage (prompt_tokens, completion_tokens) from the LLM response.
The SDK looks up the model in its pricing table and calculates cost = (input_tokens * input_rate) + (output_tokens * output_rate).
This cost is attached to the span's metadata and sent to the 2Signal API with the trace.
The cost evaluator sums all span costs to produce the total trace cost, then scores it against your thresholds.

If a span does not have cost data (e.g., non-LLM spans like tool calls or custom spans), it contributes $0 to the total. Only spans with both a recognized model and token usage are costed.

Config

Field	Type	Required	Default	Description
`max_cost_usd`	number	Yes	—	Max acceptable cost in USD (score = 0 above this)
`target_cost_usd`	number	No	`max / 2`	Ideal cost in USD (score = 1 below this)

Example

{
  "max_cost_usd": 0.05,
  "target_cost_usd": 0.01
}

Scoring

Below target_cost_usd: score = 1.0
Between target and max: linear interpolation (1.0 → 0.0)
Above max_cost_usd: score = 0.0

Use Cases

Budget enforcement — set a per-trace cost ceiling that matches your economics. If a single agent run costs more than your revenue from that interaction, you need to know immediately.
Cost regression detection — catch when prompt changes, new tools, or model upgrades silently increase costs. A prompt that adds "think step by step" might double your token usage.
Comparing model efficiency — run the same workload across different models and compare cost scores. A smaller model might score 1.0 on cost while still passing your quality evaluators.
Optimizing agent loops — agents that retry or loop excessively burn tokens. Low cost scores highlight traces where the agent ran more steps than necessary.

Choosing Thresholds

Thresholds vary widely depending on the complexity of your agent's tasks:

Use Case	target_cost_usd	max_cost_usd	Notes
Simple Q&A / classification	0.001	0.01	Single LLM call, short context
RAG with retrieval	0.005	0.03	Retrieval + generation, moderate context
Multi-step reasoning	0.01	0.10	Chain-of-thought, multiple LLM calls
Agents with tool calls	0.05	0.50	Iterative tool use, potentially many LLM calls
Complex multi-agent workflows	0.10	1.00	Multiple agents collaborating, large contexts

Start by running without a cost evaluator, then check the cost distribution in your dashboard to set realistic thresholds based on actual data.

Alerts

Cost evaluator scores integrate with the 2Signal alerting system. When a trace scores below your configured alert threshold, you can receive notifications in the dashboard. This is especially useful for catching cost spikes in production before they impact your bill.

Combine cost alerts with usage monitoring (available on the Usage page in the dashboard) for full budget visibility. Usage monitoring tracks aggregate monthly spend, while the cost evaluator catches individual expensive traces.

Have questions? Join our community.

Connect with other developers and the 2Signal team.

Join Discord