Most AI agents are over-provisioned. They use the most expensive model for every request, regardless of complexity. A simple "What's your return policy?" gets the same GPT-4o treatment as "Compare these three insurance plans across 12 dimensions and recommend the best option."
This is the single biggest source of wasted spend. Here's how to fix it.
Step 1: See Where the Money Goes
Before optimizing, you need visibility. Wrap your LLM client with 2signal to automatically track cost per request:
from twosignal.wrappers import wrap_openai
from openai import OpenAI
client = wrap_openai(OpenAI())
# Every call now tracks:
# - model used
# - prompt tokens / completion tokens
# - cost in USD
Within a day of production traffic, you'll see the cost distribution in your dashboard. In most agents, 80% of requests are simple enough for a cheaper model.
Step 2: Route by Complexity
2signal's model routing analyzes each request and routes it to the right model. Simple queries go to fast, cheap models. Complex queries go to capable, expensive models.
# Configure routing rules in the dashboard or via API
POST /api/v1/route-model
{
"input": "What's your return policy?",
"configName": "support-agent"
}
# Response: { "model": "gpt-4o-mini", "complexity": { "score": 0.15 } }
# Simple query → cheap model
POST /api/v1/route-model
{
"input": "Compare plan A vs plan B across coverage, deductibles, and out-of-pocket maximums, then recommend based on a family of 4 with...",
"configName": "support-agent"
}
# Response: { "model": "gpt-4o", "complexity": { "score": 0.82 } }
# Complex query → capable model
Routing Rule Examples
| Condition | Model | Typical Cost | |---|---|---| | Complexity < 0.3 | gpt-4o-mini | ~$0.001 | | Complexity < 0.6 | gpt-4o-mini | ~$0.002 | | Complexity >= 0.6 | gpt-4o | ~$0.02 | | Keywords: "legal", "compliance" | gpt-4o | ~$0.02 |
Step 3: Set Cost Guardrails
Add a Cost evaluator to flag requests that exceed your budget. This catches prompt bloat, unnecessary tool calls, and other cost spikes.
# Cost evaluator config
COST: {
max_cost_usd: 0.05, # hard ceiling
target_cost_usd: 0.02 # ideal cost
}
# Scores: 1.0 (at or below target) → 0.0 (at or above max)
# Any score below 0.5 = something is probably wrong
Step 4: Verify Quality Didn't Drop
The risk with cost optimization is degrading quality. Run an LLM Judge evaluator alongside cost checks to ensure cheaper models are still producing good results:
LLM_JUDGE: {
criteria: "Is the response accurate, complete, and helpful?",
scale: "1-5",
model: "gpt-4o-mini" # Meta: use a cheap model to evaluate
}
If quality scores drop after switching to cheaper models for simple queries, adjust your complexity threshold upward.
The Math
Here's what this looks like with real numbers:
| Scenario | Avg Cost/Request | Monthly (100K requests) | |---|---|---| | All GPT-4o | $0.025 | $2,500 | | With routing (80/20 split) | $0.006 | $600 | | Savings | | $1,900 (76%) |
The exact savings depend on your traffic distribution, but we consistently see 50–75% cost reduction with intelligent routing.
Common Mistakes
- Routing too aggressively — Start conservative. Route only the obviously simple queries to cheaper models and expand gradually.
- Not measuring quality — Always pair cost optimization with quality evaluation. Saving money on bad responses isn't saving money.
- Ignoring token counts — Sometimes the cost problem isn't the model, it's the prompt. Check if you're sending unnecessary context.
- One-time optimization — Traffic patterns change. Review routing rules monthly.
Learn more about model routing or set up cost evaluators to start tracking your spend.