Jim Zandueta

Posted on May 11

Stop Using Your Most Expensive AI Model to Stamp the Paperwork

#ai #opensource #productivity #showdev

AI coding tools are incredible.

They are also starting to feel a bit like ordering room service at a five-star hotel.

You ask for "one small change," blink twice, and suddenly your token bill is wearing a bow tie.

That sounds dramatic, but if you use these tools every day, you know the feeling. Claude Code, OpenCode, Cursor, Copilot, all of it. We are living through this very odd moment where a half-formed thought can turn into a working patch before your coffee cools down. Things that used to take three browser tabs, two Stack Overflow answers, and one quiet crisis at the keyboard can now happen in a single conversation.

The awkward part is that they work.

So you use them constantly.

And once you use them constantly, the economics stop being theoretical.

Token limits. Session limits. Rate limits. Premium models. Long context windows. Repeated retries. Agents writing updates to other agents like they are defending dissertations nobody asked for.

After a while, every prompt had an imaginary CFO sitting next to it.

Are you sure you want to ask that question?

Could this be done by a cheaper model?

Did you really need frontier-model reasoning to rename a variable?

Annoying little guy.

Also, annoyingly, not wrong.

That started changing how I worked.

I asked fewer questions. I tried fewer weird ideas. I caught myself saving prompts for "real" work, which is a strange place to end up with a tool that is supposed to make development feel lighter.

I wanted to keep the magic of AI coding without treating every iteration like a small financial event. I wanted to refactor, experiment, throw things away, ask dumb questions, ask better questions, and generally behave like a developer without feeling like every prompt needed a purchase order.

The Problem Wasn't the Expensive Model

I don't think premium models are the villain.

They are expensive because they are good. Annoying, but fair.

If I am making an architecture decision, I want the big brain in the room. If I am doing security review, yes, please send the adult. If I am debugging something production-shaped that smells faintly of DNS, IAM, and regret, fine. Summon the oracle.

But updating docs?

Routing a task?

Scheduling implementation work?

Making a small mechanical change that is basically "move this thing over there and run the test"?

That probably does not need the most expensive model available.

That is how the paperwork metaphor got stuck in my head. Using a frontier model for every tiny task feels like asking your most senior engineer to staple documents. They can do it. They may do it beautifully. There may even be an elegant staple placement strategy. But financially, spiritually, and operationally, something has gone off the rails.

Quality was not the thing bothering me. Allocation was. We had started treating every AI task as if it needed the same level of reasoning, the same amount of context, and the same budget. Convenient, yes. Also lazy in a very expensive way.

The Open Model Detour

The other thing I could not ignore: cheaper and open-weight models had quietly become good.

Not "replace everything tomorrow" good.

Not "redesign my distributed system while I make lunch" good.

But absolutely "why am I paying premium rates for this README edit?" good.

The price differences can be silly. In some cases, you are not talking about 10% cheaper. You are talking about one or two orders of magnitude cheaper. That changes behavior. When every request feels expensive, you get cautious. You ask fewer questions. You experiment less. You stop doing those weird little exploratory loops where you learn things.

AI turns from a development partner into an emergency resource in a glass box.

Break only in case of billable disaster.

My first attempt at fixing this was not elegant. I patched Claude Code to point at a local Ollama server instead of Anthropic's API. It worked, which was both exciting and slightly cursed. I got decent uninterrupted work done, but it felt clanky. Like taping a jet engine to a bicycle and then insisting the bicycle had achieved flight.

That detour pushed me toward OpenCode, because OpenCode is model-agnostic. You can connect different providers and switch models without changing your whole workflow.

That mattered more than I expected, because model choice stopped feeling like a separate chore and started feeling like part of the workflow.

The Better Question

Once model switching was easy, I kept coming back to the same question: why is one model doing all the jobs?

Not because cheap models should do everything. That would be a different kind of mistake.

What started to make sense was using the right model for the right job, and not dragging unnecessary context into every call.

Some work is coordination. Some work is implementation. Some work is review. Some work is genuinely hard reasoning. Those should not all be priced, routed, or treated the same way.

You would not ask your most expensive lawyer to print the documents, book the meeting room, and staple the contract.

You want them involved when judgment matters. You do not need them guarding the printer tray.

AI models are not exactly lawyers, thank God, but the principle is similar.

Use expensive reasoning where expensive reasoning changes the outcome. Use cheaper models where the task is bounded, mechanical, or mostly coordination.

That split became the first useful rule of thumb.

Tier	Used For	Why It Helps
Cheap and fast	Routing, scheduling, bounded tasks	Keeps routine coordination inexpensive
Balanced	Everyday implementation and review	Preserves quality where most work happens
Premium	Deep architecture, hard problems, security escalation	Saves expensive reasoning for high-leverage moments

Spend the Big Brain Where It Matters

Cost is where the difference shows up first.

Not in the sad sense of "make it cheaper by making it worse."

Nobody wants a cheap system that creates expensive problems. That is not savings. That is deferred pain with better branding.

Spend money where it matters. Put premium models on judgment-heavy tasks. Put balanced models on normal implementation. Put cheap and fast models on bounded coordination work.

That is a much healthier pattern than sending everything through the same premium model and hoping the invoice develops mercy.

It also changes how you work. Lower marginal cost makes experimentation feel normal again. You ask the extra question. You test the weird idea. You let the machine do the boring cleanup. That matters, because good software work is often iterative and mildly messy before it becomes obvious.

Make the Agents Stop Writing Novellas

The second cost saving is funnier, because it is really about manners.

Agents talk too much.

Not to users. Users deserve clarity. I want the explanation. I want the reasoning. I want to know what changed and why.

I mean agents talking to other agents.

A typical AI handoff can be ridiculously verbose:

After carefully analyzing the implementation surface area, I have determined that the backend engineer should inspect the following files and consider the implications of the existing authentication flow...

Sir, this is a task handoff. Nobody asked for the director's commentary.

The user-facing explanation can stay clear. The machine-to-machine parts can be terse.

Something like:

Need backend fix. Files: X, Y. Run tests: Z. Risk: auth. Do not touch docs.

Beautiful.

Primitive.

Financially responsible.

The gimmick is funny, but the point is real: internal communication is not free. Every extra word passed between agents costs tokens. Every long summary adds context. Every polite essay burns budget without necessarily improving the outcome.

At some point, verbosity becomes a feature you are paying to remove.

Context Is Not Free

The cost part was obvious. The performance part snuck up on me.

Large AI conversations get heavy in a way that is hard to notice until you have lived inside one for a while. When one big agent does everything, it drags a huge amount of context around with it: requirements, file contents, previous decisions, failed attempts, summaries, side quests, emotional baggage, maybe a recipe for banana bread from three prompts ago.

That context has a cost.

It can slow things down. It can increase spend. It can also make the model less focused. Not always dramatically, but enough that you feel the drag.

The more irrelevant material in the context window, the more chances the model has to get distracted, overreach, or "helpfully" modify something nobody asked it to touch. We have all seen that moment where a small task becomes a surprise renovation.

The boring-but-effective version: give each role the context it needs, not the whole attic.

The planner gets planning context.

The frontend engineer gets frontend context.

The backend engineer gets backend context.

The reviewer gets the implementation summary and verification evidence.

We already know this from normal work: you do not invite the entire company into every meeting. At least, you should not. Some calendars are crimes.

Same idea here.

Smaller task-specific context usually means faster calls, lower cost, and better focus. It also makes the system easier to reason about. If something goes wrong, you can usually tell where it happened: design, planning, implementation, review, or escalation.

That is much better than staring at one giant chat transcript and wondering where the raccoon entered the ventilation system.

Process Keeps the Magic From Touching Production Unsupervised

Cost and speed are great.

But the thing that made this feel worth turning into a real workflow was control.

AI coding can feel magical, but magic is not a software delivery process. Magic is useful. Magic is fun. Magic should probably not have write access to production-adjacent files without an adult nearby.

A single agent can write a lot of code quickly. That is useful. It is also risky if there are no boundaries.

Sometimes the agent edits unrelated files.

Sometimes it skips tests.

Sometimes it solves the wrong problem confidently.

Sometimes it decides that your small UI tweak is the perfect moment to refactor authentication, redesign the database schema, and spiritually rebrand the company. Inspiring? Maybe. Requested? Absolutely not.

At that point, plain old software process starts looking less boring.

For substantial work, the flow should look more like this:

Yes, there is ceremony in that. I get it. Nobody wants every typo fix to become a government procurement process.

Tiny, safe tasks should have a fast path.

But for substantial work, I want the boring safeguards.

Before code is written, there is a design.

Before implementation starts, the user approves the plan.

Implementation tasks have file locks.

Sensitive areas like auth, payments, secrets, production config, and personal data trigger extra care.

Agents must provide verification evidence before calling work complete.

Review is a real phase, not a vibes-based victory lap.

Not to slow everything down. To stop AI from being confidently chaotic.

Because the problem with AI coding is not that it cannot produce code. It can produce lots of code. Buckets of code. Code with confidence. Code with explanations. Code that looks like it has somewhere important to be.

The hard part is making sure it produces the right code, in the right place, with the right review, for the right reason.

What This Turned Into

That is the shape of oowl: OpenCode Opinionated Workflow Layer.

Repo: https://github.com/jimzandueta/oowl

Docs: https://jimzandueta.github.io/oowl/docs/

The name is a mouthful, but the idea is not. It is an open-source workflow layer on top of OpenCode that treats AI coding less like one heroic chat session and more like a small engineering process.

Instead of one giant agent carrying the whole universe, work moves through specialized roles: dispatcher, architect, planner, implementation agents, reviewers, security escalation, low-cost workers for bounded tasks, premium agents when the problem is actually hard.

Basically, a small engineering team.

Except nobody asks to move standup to 4:30 and the backend engineer does not have opinions about espresso.

Not cheap models everywhere. More like: put the expensive thinking where it belongs, and keep the rest lean:

Model switching reduces cost by matching the model to the task.
Caveman-Lite reduces waste by compressing internal agent communication.
Smaller task-specific contexts improve speed and focus.
An opinionated SDLC process makes the workflow safer and more reliable.

The reason I care about this is that AI coding is moving from novelty to infrastructure.

The early question was:

Can AI write useful code?

The more useful question now is:

Can AI help ship software reliably, repeatedly, affordably, and safely?

That changes the shape of the work.

And honestly, that is where the conversation gets more interesting to me.

Model quality matters, obviously. Better models help. But once the models are useful enough, workflow quality starts to matter just as much.

The future of AI development probably is not one chat window with one all-powerful model doing everything. That is a useful demo, but it is a weird operating model.

Orchestration feels more durable.

Different agents. Different responsibilities. Different models. Clear handoffs. Compressed internal communication. Human approval where it actually matters.

That is the piece oowl is trying to explore.

Not perfect. Not universal. Definitely not the final form of anything. But useful enough that I wanted to put it out there.

Still magical.

Just with fewer surprise invoices.

And slightly more caveman.

DEV Community