Stop Using GPT-4 for Everything
Most AI products send every request to GPT-4. It's expensive, slow, and unnecessary. Here's how semantic routing cuts your OpenAI bill by 60% without losing quality.
The problem
You built an AI product. Users love it. Then you check your OpenAI dashboard: $2,000/month and growing. Every message — whether it's "hi" or a complex analysis request — goes through GPT-4o at $10/M tokens.
80% of queries are simple. Greetings, FAQ, basic lookups. They don't need GPT-4. They need a fast, cheap model that responds in 200ms instead of 2 seconds.
The solution: semantic routing
Route each query to the right model based on complexity:
- Simple queries (greetings, FAQ, yes/no) → GPT-4o-mini ($0.15/M tokens)
- Medium queries (summarization, extraction) → GPT-4o-mini or Gemini Flash
- Complex queries (analysis, reasoning, code) → GPT-4o ($10/M tokens)
How the router works
A lightweight classifier (can be rule-based or a small model) analyzes the incoming query and assigns a complexity score. Based on that score, it routes to the appropriate model.
# Simplified routing logic
def route_query(text: str) -> str:
complexity = classify_complexity(text)
if complexity < 0.3:
return "gpt-4o-mini" # $0.15/M tokens
elif complexity < 0.7:
return "gemini-flash" # Free tier available
else:
return "gpt-4o" # $10/M tokens
The classifier itself can be as simple as keyword matching + message length, or as sophisticated as a fine-tuned embedding model. In our production systems, we use a hybrid approach that adds ~5ms of latency.
Real numbers
From our production bot (600+ active users):
- Before routing: ~$1,800/month (all GPT-4o)
- After routing: ~$720/month (mixed models)
- Quality difference: undetectable by users
- Latency improvement: 40% faster average response
The best model for a query is the cheapest model that gives an acceptable answer.
Implementation tips
- Start with rules. Don't overthink the classifier. Short messages + common patterns = mini. Everything else = full model.
- Log everything. Track which model handled each query and whether the user was satisfied. This data trains your router.
- Add fallback. If the cheap model produces a low-confidence answer, automatically escalate to the expensive one.
- A/B test. Route 10% of simple queries to GPT-4o and compare. If users can't tell the difference, your router works.
Beyond cost savings
Routing isn't just about money. It's about speed. GPT-4o-mini responds 3-5x faster than GPT-4o. For chat interfaces, that's the difference between "snappy" and "laggy." Users notice.
It's also about resilience. If OpenAI has an outage (and they do), your router can failover to Gemini or Claude automatically. Multi-model architecture is more robust than single-vendor dependence.
Want us to audit your AI costs?
Describe your setup on Telegram — we'll tell you where you're overspending. Or book a call if you prefer to talk.
Write on Telegramor book a 30-min call