Stop Using GPT-4 for Everything

Most AI products send every request to GPT-4. It's expensive, slow, and unnecessary. Here's how semantic routing cuts your OpenAI bill by 60% without losing quality.

TL;DR

The problem

You built an AI product. Users love it. Then you check your OpenAI dashboard: $2,000/month and growing. Every message — whether it's "hi" or a complex analysis request — goes through GPT-4o at $10/M tokens.

80% of queries are simple. Greetings, FAQ, basic lookups. They don't need GPT-4. They need a fast, cheap model that responds in 200ms instead of 2 seconds.

The solution: semantic routing

Route each query to the right model based on complexity:

How the router works

A lightweight classifier (can be rule-based or a small model) analyzes the incoming query and assigns a complexity score. Based on that score, it routes to the appropriate model.

# Simplified routing logic
def route_query(text: str) -> str:
    complexity = classify_complexity(text)
    if complexity < 0.3:
        return "gpt-4o-mini"  # $0.15/M tokens
    elif complexity < 0.7:
        return "gemini-flash"  # Free tier available
    else:
        return "gpt-4o"       # $10/M tokens

The classifier itself can be as simple as keyword matching + message length, or as sophisticated as a fine-tuned embedding model. In our production systems, we use a hybrid approach that adds ~5ms of latency.

Real numbers

From our production bot (600+ active users):

The best model for a query is the cheapest model that gives an acceptable answer.

Implementation tips

  1. Start with rules. Don't overthink the classifier. Short messages + common patterns = mini. Everything else = full model.
  2. Log everything. Track which model handled each query and whether the user was satisfied. This data trains your router.
  3. Add fallback. If the cheap model produces a low-confidence answer, automatically escalate to the expensive one.
  4. A/B test. Route 10% of simple queries to GPT-4o and compare. If users can't tell the difference, your router works.

Beyond cost savings

Routing isn't just about money. It's about speed. GPT-4o-mini responds 3-5x faster than GPT-4o. For chat interfaces, that's the difference between "snappy" and "laggy." Users notice.

It's also about resilience. If OpenAI has an outage (and they do), your router can failover to Gemini or Claude automatically. Multi-model architecture is more robust than single-vendor dependence.

Frequently Asked Questions

What is semantic routing for LLMs?

Semantic routing classifies each incoming query by complexity and sends it to the cheapest model that can handle it. Simple queries (greetings, FAQs, yes/no) go to small models like GPT-4o-mini. Complex queries (analysis, reasoning, code) go to GPT-4o. The classifier itself adds about 5ms of latency.

How much can semantic routing reduce OpenAI costs?

In NeCL's production bot serving 600+ active users, monthly OpenAI spend dropped from ~$1,800 to ~$720 — roughly a 60% reduction — with no detectable change in answer quality.

Will quality drop if I route easy queries to cheaper models?

Not if the router is calibrated. About 80% of real-world queries are simple enough that smaller models handle them as well as GPT-4. A confidence-based fallback automatically escalates low-confidence answers to the larger model.

Does the router add latency?

The classifier adds ~5ms. But routing reduces overall latency by 40% on average, because GPT-4o-mini responds 3–5x faster than GPT-4o for the queries that don't need full reasoning.

How do I build a query complexity classifier?

Start with rules: short messages and common patterns → cheap model; everything else → full model. Log which model handled each query and whether the user was satisfied. That data lets you upgrade later to a fine-tuned embedding classifier if needed.

Can semantic routing also improve reliability?

Yes. A multi-model router can failover automatically when one provider has an outage — for example, routing to Gemini or Claude if OpenAI is down. Multi-vendor architecture is more robust than single-vendor dependence.

Want us to audit your AI costs?

Describe your setup on Telegram — we'll tell you where you're overspending. Or book a call if you prefer to talk.

Write on Telegram

or book a 30-min call