Elena Revicheva

Posted on May 11 • Originally published at aideazz.hashnode.dev

Multi-Model LLM Routing: Why I Send 76% to Groq

#ai #programming #machinelearning

Originally published on AIdeazz — cross-posted here with canonical link.

Running production agents taught me something counterintuitive: using Claude or GPT-4 for everything is like hiring a surgeon to take blood pressure. After analyzing 50,000+ agent interactions across our Oracle-hosted systems, I found that smart multi-model LLM routing cuts costs by 82% while actually improving response times.

The Economics of Intelligence Overkill

Most developers default to the "best" model for everything. I did too, burning $3,400/month on Claude API calls for a Telegram customer service bot that mostly answered FAQs. The wake-up call came when I instrumented our agents and discovered that 76% of queries were simple pattern matching: order status checks, business hours, pricing questions.

Here's what that looked like in production:

Claude 3.5 Sonnet: $15/million output tokens
Mixtral 8x7B on Groq: $0.24/million output tokens
Average response: 180 tokens
Daily volume: 8,000 queries

Do the math: that's $21.60/day for Claude versus $0.35/day for Mixtral on routine queries. For context-heavy tasks requiring 2,000+ token responses, the difference becomes $30 versus $0.48 per thousand queries.

The real insight wasn't just cost—it was latency. Groq delivers Mixtral responses in 89ms average, while Claude averages 1.8 seconds for similar queries. For WhatsApp agents handling order status checks, that 20x speed difference directly impacts user satisfaction.

Building the Router: Classification Before Generation

Multi-model LLM routing starts with accurate query classification. I tried three approaches before finding one that works at scale:

Approach 1: Regex and Keywords (Failed)
Started with pattern matching—if query contains "order" and "status", route to fast model. This broke immediately. Users ask "what's happening with my thing?" or "donde esta mi pedido?" Pattern matching couldn't handle language mixing or creative phrasing.

Approach 2: Embedding Similarity (Partially Failed)
Used OpenAI embeddings to compare queries against example clusters. Better than regex, but required maintaining example sets and struggled with edge cases. A query about "refund policy for damaged items during shipping" would match both "refunds" and "shipping" clusters with similar scores.

Approach 3: Fast Classification Model (Working)
Now I use a dedicated Mistral-7B instance for routing decisions. It sees the query and returns a JSON classification:

{
  "complexity": "simple|moderate|complex",
  "requires_context": true/false,
  "category": "transactional|analytical|creative",
  "confidence": 0.95
}

This classification layer adds 73ms latency but saves seconds on complex queries by routing them correctly upfront. The classifier runs on a single A10 GPU on Oracle Cloud, handling 200 requests/second.

The Routing Logic That Actually Ships

Here's the decision tree that powers our production agents:

def route_query(query, conversation_history, user_context):
    classification = classifier.analyze(query, conversation_history)

    if classification['complexity'] == 'simple' and classification['confidence'] > 0.9:
        return 'groq-mixtral'  # 76% of queries land here

    if classification['requires_context'] and len(conversation_history) > 5:
        return 'claude-3.5-sonnet'  # Long conversations need memory

    if classification['category'] == 'analytical' and user_context.get('premium'):
        return 'claude-3.5-sonnet'  # Premium users get premium analysis

    if system_load() > 0.8 and classification['complexity'] != 'complex':
        return 'groq-mixtral'  # Graceful degradation under load

    return 'groq-llama-70b'  # Middle ground: better than Mixtral, cheaper than Claude

The key insight: routing isn't just about query complexity. System load, user tier, conversation depth, and business rules all factor in. During Black Friday traffic spikes, we automatically shift more queries to Groq to maintain sub-second response times.

When Frontier Models Become Essential

After routing thousands of queries daily, clear patterns emerged for when Claude or GPT-4 become necessary:

Multi-turn reasoning: Customer negotiating a bulk discount while referencing previous orders, company policies, and competitor pricing. Mixtral loses context after 3-4 turns.

Document analysis: User uploads a 40-page contract asking about liability clauses. Open-weight models hallucinate legal interpretations.

Creative generation: Marketing agency wants 20 variations of ad copy matching their brand voice. Llama-70B produces generic outputs.

Language edge cases: Code-switching between Spanish, English, and Portuguese in the same conversation. Smaller models default to one language.

High-stakes decisions: Refund authorization over $500, medical advice disclaimers, or legal compliance questions. The cost of errors exceeds API pricing.

I track "escalation rate"—how often agents need to bump up to a stronger model mid-conversation. Currently at 4.2%, mostly for context accumulation. If a Mixtral conversation exceeds 10 turns, we seamlessly hand off to Claude, passing full context.

The Oracle Stack: Why Infrastructure Matters

Running multi-model routing on Oracle Cloud Infrastructure shapes our architecture. Key constraints and advantages:

GPU allocation: Oracle's A10 instances are 40% cheaper than AWS for inference, but require 24-hour commitments. This pushes us toward reserved capacity for base load (Mixtral/Llama) while using serverless for Claude spikes.

Network topology: Oracle's FastConnect gives us 1.2ms latency to Groq's API, versus 18ms to Anthropic. For high-frequency trading bots, this 15x difference justifies local model hosting.

Storage patterns: Oracle Object Storage integrates poorly with vector databases. We cache embeddings in Redis, adding complexity but improving query classification speed by 3x.

Kubernetes overhead: OKE (Oracle Kubernetes Engine) adds 8-12% overhead versus bare metal, but enables zero-downtime model swaps. Critical when updating router logic without dropping conversations.

The production setup:

3x A10 nodes: Mistral-7B classifier, Llama-70B inference, Mixtral backup
2x CPU nodes: Redis, request routing, monitoring
1x GPU node: Emergency overflow for Claude API failures
Total monthly cost: $1,847 (versus $4,200 on AWS)

Failure Modes and Recovery Patterns

Multi-model systems fail in ways single-model deployments don't:

Model inconsistency: Mixtral says "refund approved," then Claude reviews and says "refund denied" in the same conversation. Solution: tag responses with model version and maintain decision logs.

Context loss during handoff: User builds complex query with Mixtral, escalates to Claude, which lacks nuanced context. Solution: summarize key decisions before handoff, not just message history.

Latency cascades: Classifier timeout causes fallback to Claude, which overloads during traffic spike. Solution: circuit breakers with local Mixtral fallback, accept degraded quality over downtime.

Cost explosion: Bug in classification logic routes 100% to Claude for 6 hours. Burned $420 before alerts fired. Solution: hard spending limits with automatic model downgrade.

Real failure from last month: Groq API went down during our peak hours. The system correctly failed over to local Mixtral, but the local model hadn't been updated in 3 weeks. It gave outdated product information until we noticed customer complaints 2 hours later. Now we sync model knowledge daily and test failover paths.

Measuring What Matters: Beyond Cost Per Token

Traditional metrics miss the point of multi-model routing. Here's what I actually track:

Resolution rate by model: Mixtral resolves 84% of simple queries without escalation. Claude resolves 97% but costs 62x more. The 13% difference isn't worth the cost for FAQ-style queries.

Time-to-resolution: Groq-hosted models average 1.4 seconds to complete response. Claude averages 4.8 seconds. For impatient WhatsApp users, speed beats marginal quality improvements.

Conversation profit margin: Revenue per conversation minus infrastructure cost. Premium tier conversations averaging $12 revenue justify Claude ($0.38 cost). Free tier averaging $0.80 revenue requires Mixtral ($0.006 cost).

Model drift detection: Weekly comparison of model outputs on standard queries. Llama-70B quality improved 12% over 3 months, changing our routing thresholds.

User satisfaction by model: NPS scores show no statistical difference between Mixtral and Claude for transactional queries. 8-point difference for analytical queries.

The meta-lesson: optimize for business outcomes, not model benchmarks. A fast, cheap response that solves the user's problem beats a slow, expensive response with marginally better prose.

The Tactical Reality of Multi-Model Production

After shipping agents that handle everything from customer support to trading signals, here's what multi-model LLM routing actually requires:

Version lock everything: Models change. Claude 3.5 behaves differently than Claude 3. Your router logic assumes specific model behaviors. Pin versions and test updates in staging.

Log every decision: When a customer complains about inconsistent responses, you need to trace which model said what and why the router chose it. Storage is cheap, debugging production is expensive.

Design for partial degradation: When GPT-4 rate limits hit, can you serve 80% quality at 100% availability? Build graceful degradation into your routing logic.

Monitor cost per conversation, not per token: A conversation that bounces between models might use more tokens but cost less overall. Track end-to-end metrics.

Plan for model deprecation: OpenAI sunset GPT-3. Anthropic will sunset Claude 2. Your router must handle model retirement without breaking production flows.

Multi-model LLM routing isn't about using the cheapest model everywhere. It's about matching computational cost to business value, dynamically, at scale. When you nail it, you serve more users, faster, with better economics. The 76% of queries I send to Groq subsidize the 24% that need frontier model intelligence. That's how we ship AI agents that actually make money.

— Elena Revicheva · AIdeazz · Portfolio

DEV Community