Stop Using GPT-4 for Everything

Most AI products send every request to GPT-4. It's expensive, slow, and unnecessary. Here's how semantic routing cuts your OpenAI bill by 60% without losing quality.

The problem

You built an AI product. Users love it. Then you check your OpenAI dashboard: $2,000/month and growing. Every message — whether it's "hi" or a complex analysis request — goes through GPT-4o at $10/M tokens.

80% of queries are simple. Greetings, FAQ, basic lookups. They don't need GPT-4. They need a fast, cheap model that responds in 200ms instead of 2 seconds.

The solution: semantic routing

Route each query to the right model based on complexity:

How the router works

A lightweight classifier (can be rule-based or a small model) analyzes the incoming query and assigns a complexity score. Based on that score, it routes to the appropriate model.

# Simplified routing logic
def route_query(text: str) -> str:
    complexity = classify_complexity(text)
    if complexity < 0.3:
        return "gpt-4o-mini"  # $0.15/M tokens
    elif complexity < 0.7:
        return "gemini-flash"  # Free tier available
    else:
        return "gpt-4o"       # $10/M tokens

The classifier itself can be as simple as keyword matching + message length, or as sophisticated as a fine-tuned embedding model. In our production systems, we use a hybrid approach that adds ~5ms of latency.

Real numbers

From our production bot (600+ active users):

The best model for a query is the cheapest model that gives an acceptable answer.

Implementation tips

  1. Start with rules. Don't overthink the classifier. Short messages + common patterns = mini. Everything else = full model.
  2. Log everything. Track which model handled each query and whether the user was satisfied. This data trains your router.
  3. Add fallback. If the cheap model produces a low-confidence answer, automatically escalate to the expensive one.
  4. A/B test. Route 10% of simple queries to GPT-4o and compare. If users can't tell the difference, your router works.

Beyond cost savings

Routing isn't just about money. It's about speed. GPT-4o-mini responds 3-5x faster than GPT-4o. For chat interfaces, that's the difference between "snappy" and "laggy." Users notice.

It's also about resilience. If OpenAI has an outage (and they do), your router can failover to Gemini or Claude automatically. Multi-model architecture is more robust than single-vendor dependence.

Want us to audit your AI costs?

Describe your setup on Telegram — we'll tell you where you're overspending. Or book a call if you prefer to talk.

Write on Telegram

or book a 30-min call