DEV Community

Are We Using AI at the Wrong Scale?

Kernel Pryanic on April 28, 2026

We open our IDE and let a model running somewhere in the cloud read our entire codebase to add a null check - and track our behaviour along the way...
Collapse
 
pengeszikra profile image
Peter Vivo

The cheapest options to not use AI for a task is always there, but we are so lazy. Lazy for select a perfect MI for a specify task. So your idea can be delegated after AI API level. So on the first layer of that AI decide which model used for answer.

Collapse
 
kernelpryanic profile image
Kernel Pryanic

That's one way to address this, having a router model to decide what dedicated model to use. This router model can conveniently be a small one too :) The other part is there's no mature infrastructure this router can delegate execution to yet. Maybe it can be done like OpenRouter does across hosted providers. I hope we'll accelerate in this direction.

Collapse
 
mininglamp profile image
Mininglamp

The "wrong scale" framing nails it. There's a pattern emerging in production systems: teams start with GPT-4 for everything, then realize 80% of their calls are simple classification or extraction tasks that a 7B model handles fine. The economic math is brutal — a $0.03/call API for a task a local model does in 200ms for free. The real question isn't "cloud vs local" but "what's the minimum viable model for each task in your pipeline?" Most production agents should probably be running 3-4 different sized models routed by task complexity, not one giant model for everything.

Collapse
 
capestart profile image
CapeStart

I like this take because it’s not anti-AI, it’s anti-waste. Matching the tool to the job should be the default.

Collapse
 
tamsiv profile image
TAMSIV

Solo dev on a voice AI app here, this hits a nerve. We hit the same wall around month 4: Claude or GPT-class for every voice intent, every UUID resolution, every "did the user mean tomorrow morning or this morning". Latency was fine, the bill was not.

What unblocked it: routing per task. Tiny model for intent classification and tokenized search, mid-tier (DeepSeek V4 Flash family) for multi-turn function calling chains, big model only when the user actually needs reasoning. OpenRouter as the swap layer so we can change the routing without touching app code.

The harder part isn't the routing, it's the eval. A small model that's 95% as good is a small model that's 5% catastrophic on the long tail. We had to log every degraded answer and bump the routing tier when patterns repeated. Static "small for cheap, big for hard" doesn't survive contact with users who ask weird things.

The framing in the post is right though. The default of "shove the biggest model into every hole" stops scaling well past a certain volume of requests. The interesting work is in the boring middle layer where you decide which call gets which brain.

Collapse
 
kernelpryanic profile image
Kernel Pryanic

That's a very useful experience that shows that it's often not a simple replacement in loosely deterministic cases. Very cool that you went this way trying to optimize things. Can I ask if you tried to do an in-place replacement with smaller models or fine-tuned them first? Small models have their limits, though often you can sharpen them if there's data available.

Collapse
 
tamsiv profile image
TAMSIV

Good question. Yes, I tested both paths.

In-place smaller model: tried Llama 3.1 8B and Qwen 2.5 7B locally via Ollama for the simpler intents (date parsing, simple "create task X" with no group context). Worked at maybe 75% of the time, but the failure mode was bad. Wrong UUIDs for "remind me about Sylvie's appointment" when there are 3 Sylvies in the user's memory. The 25% failure rate destroyed the trust faster than the latency win was worth, because each wrong UUID = a task created in the wrong folder, which is harder to spot than no task at all.

Fine-tuning: I haven't gone full SFT yet because I don't have enough labeled data per user (TAMSIV only launched in alpha mid-March, around 100 users actively using the voice). I started collecting (audio, intent, correct UUID) triplets via a thumbs-down feedback loop in the app, with the idea of fine-tuning a 7B once I have a couple thousand corrected pairs per intent class. That's probably 2-3 months out.

What unlocked the cost cut without sacrificing accuracy was routing per-task. DeepSeek V4 Flash for the disambiguation-heavy stuff (multi-Sylvie, fuzzy date), GPT-4o-mini for simple stuff, full Claude only on the rare ambiguous cases. The router itself is a tiny prompt that tags the difficulty class, runs on the cheapest tier, and the rest follows. CPC dropped 60% on inference without quality regression measured on a 200-sample eval set.

Curious about your "loosely deterministic" framing. Is that a term from your team or borrowed? It captures something I've struggled to name when explaining why some tasks feel safe to downsize and others don't.

Thread Thread
 
kernelpryanic profile image
Kernel Pryanic • Edited

Thanks for the insight! Yeah, sounds like routing is the right way to go.

I think fine-tuning could help with small models issues, but there are also some fundamental limits to the capacity small models can work with, though again these limits can be stretched if you have an abundance of data. In some cases of lacking data, distillation from large models could help - I had a semi-synthetic dataset once for parsing and structurizing CVs that worked really well for an 8B model, but not sure if this is applicable in your case.
As for "loosely deterministic" framing it's just a term I used to describe the complexity, entropy of the environment model needs to work in.

Collapse
 
narnaiezzsshaa profile image
Narnaiezzsshaa Truong

Treating it as a being is what gets us reaching for the largest possible model every time.

If you think you’re “asking a smart assistant,” you reach for the smartest one.
If you think you’re “invoking a tool,” you pick the right tool.

Most users are still in the “assistant” mental model.

Collapse
 
jtorchia profile image
Juan Torchia

There's a real tension here between scale and trust surface. Sending your entire codebase to a cloud model to add a null check is the kind of thing that would fail a security review at any company handling sensitive data — and yet that's the default workflow most people adopt without thinking. At Lakaut we work with identity validation and digital signatures, so the 'what exactly goes into the context window' question became a hard constraint, not a nice-to-have. Local or sandboxed models aren't just a performance choice at that point.

Collapse
 
codecraft154 profile image
codecraft

The chainsaw to slice bread analogy is the most honest framing of this I've seen. We didn't end up here because anyone made a deliberate decision to over-engineer everything. We ended up here because the cloud paradigm is being pushed hard by everyone with a financial stake in keeping us there, and 'using the biggest available model' became the default before most people thought to question it.

A 5B model, purpose-built for one task, outperforming GPT-class generalists on that task isn't a surprise if you think about it clearly. It's what happens when you match the tool to the job. We just stopped doing that somewhere along the way. The missing piece you're pointing at, the orchestration layer for small model pipelines, is genuinely the most compelling engineering problem right now. ComfyUI proved the paradigm works for image and video. The equivalent for language tasks, something stable, composable, and not held together by Python, is still waiting to be built. I think that feels like the actual frontier, not the next 100B parameter announcement

Collapse
 
vicchen profile image
Vic Chen

This resonates a lot. Building AI tools for financial data analysis, I keep seeing the same pattern — teams defaulting to frontier models for tasks a fine-tuned 7B would handle better and cheaper. The "AI as a being" vs "AI as a tool" framing is spot on. Right-sizing models to tasks is probably the most underrated skill in the field right now.

Collapse
 
kernelpryanic profile image
Kernel Pryanic • Edited

Yeah, that's the thing, you don't need to chaise the latest largest models, you can stick with a small one properly fine tuned and you possibly wouldn't need to touch it for years.

Collapse
 
vicchen profile image
Vic Chen

Same here. Once people see AI as a thinking amplifier instead of a shortcut machine, the outputs get a lot better.

The weird part is that this lesson is simple, but most people only learn it after wasting a lot of time on shallow prompts.

Collapse
 
pururva_agarwal_49847572a profile image
Pururva Agarwal

The article nails the AI scale problem. In health, generic models lack cultural depth. A US-trained AI, despite its size, won't reliably interpret 'kaaichal' (Tamil for fever) in an Ayurvedic context. Its board forbids \"desi ilaaj\" (traditional remedies) cross-verification. This is a structural moat.

True utility demands culturally relevant data focus, not just raw parameters. I'm building GoDavaii to tackle these deep contextual challenges.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

we hit this with agent orchestration - running heavy models for tasks a small classifier could handle. latency and cost stack fast. the tricky part is most teams don’t notice until they’re scaling.

Collapse
 
yogesh_vk profile image
Yogesh VK

Rightly said. This is still early adoption phase. We will start fine-tuning this once we understand the true costs better.

Collapse
 
varsha_ojha_5b45cb023937b profile image
Varsha Ojha

Good question. Feels like a lot of teams are scaling AI before they’ve figured out where it actually adds value. Bigger usage doesn’t always mean better outcomes.

Collapse
 
kernelpryanic profile image
Kernel Pryanic

Yes, in general it is really important to understand the value you can get and the limits it has, with incorrect assessment teams could find themselves in a deep tech debt and end up allowing Claude remove their production DB instances. And unfortunately this rush also feeds the bubble.