← Back to Blog
March 8, 2026·6 min read

Most AI agents are over-provisioned. They use the most expensive model for every request, regardless of complexity. A simple "What's your return policy?" gets the same GPT-4o treatment as "Compare these three insurance plans across 12 dimensions and recommend the best option."

This is the single biggest source of wasted spend. Here's how to fix it.

Step 1: See Where the Money Goes

Before optimizing, you need visibility. Wrap your LLM client with 2signal to automatically track cost per request:

from twosignal.wrappers import wrap_openai
from openai import OpenAI

client = wrap_openai(OpenAI())

# Every call now tracks:
# - model used
# - prompt tokens / completion tokens
# - cost in USD

Within a day of production traffic, you'll see the cost distribution in your dashboard. In most agents, 80% of requests are simple enough for a cheaper model.

Step 2: Route by Complexity

2signal's model routing analyzes each request and routes it to the right model. Simple queries go to fast, cheap models. Complex queries go to capable, expensive models.

# Configure routing rules in the dashboard or via API
POST /api/v1/route-model
{
  "input": "What's your return policy?",
  "configName": "support-agent"
}

# Response: { "model": "gpt-4o-mini", "complexity": { "score": 0.15 } }
# Simple query → cheap model
POST /api/v1/route-model
{
  "input": "Compare plan A vs plan B across coverage, deductibles, and out-of-pocket maximums, then recommend based on a family of 4 with...",
  "configName": "support-agent"
}

# Response: { "model": "gpt-4o", "complexity": { "score": 0.82 } }
# Complex query → capable model

Routing Rule Examples

| Condition | Model | Typical Cost | |---|---|---| | Complexity < 0.3 | gpt-4o-mini | ~$0.001 | | Complexity < 0.6 | gpt-4o-mini | ~$0.002 | | Complexity >= 0.6 | gpt-4o | ~$0.02 | | Keywords: "legal", "compliance" | gpt-4o | ~$0.02 |

Step 3: Set Cost Guardrails

Add a Cost evaluator to flag requests that exceed your budget. This catches prompt bloat, unnecessary tool calls, and other cost spikes.

# Cost evaluator config
COST: {
  max_cost_usd: 0.05,    # hard ceiling
  target_cost_usd: 0.02  # ideal cost
}

# Scores: 1.0 (at or below target) → 0.0 (at or above max)
# Any score below 0.5 = something is probably wrong

Step 4: Verify Quality Didn't Drop

The risk with cost optimization is degrading quality. Run an LLM Judge evaluator alongside cost checks to ensure cheaper models are still producing good results:

LLM_JUDGE: {
  criteria: "Is the response accurate, complete, and helpful?",
  scale: "1-5",
  model: "gpt-4o-mini"  # Meta: use a cheap model to evaluate
}

If quality scores drop after switching to cheaper models for simple queries, adjust your complexity threshold upward.

The Math

Here's what this looks like with real numbers:

| Scenario | Avg Cost/Request | Monthly (100K requests) | |---|---|---| | All GPT-4o | $0.025 | $2,500 | | With routing (80/20 split) | $0.006 | $600 | | Savings | | $1,900 (76%) |

The exact savings depend on your traffic distribution, but we consistently see 50–75% cost reduction with intelligent routing.

Common Mistakes

  • Routing too aggressively — Start conservative. Route only the obviously simple queries to cheaper models and expand gradually.
  • Not measuring quality — Always pair cost optimization with quality evaluation. Saving money on bad responses isn't saving money.
  • Ignoring token counts — Sometimes the cost problem isn't the model, it's the prompt. Check if you're sending unnecessary context.
  • One-time optimization — Traffic patterns change. Review routing rules monthly.

Learn more about model routing or set up cost evaluators to start tracking your spend.