Abhishek

Posted on May 11

Why AI Agents Cost More Than LLMs (And How to Stop Bleeding Tokens)

#ai #productivity

I was building a small bookmark app last weekend. You send it a URL, Gemini
summarizes and tags the page, the result goes into Postgres. A few hundred lines
of TypeScript.

The first version cost almost nothing. One LLM call per URL, that's it. Then I
added "tools" so the model could fetch pages, look up similar bookmarks, or
check things against Google Search.

My token bill quadrupled.

That's where most people building agents land for the first time. Going from a
plain chat call to an agent loop is way more expensive than docs make it sound,
and the reason isn't obvious until you watch the round trips happen one by one.
Let's do that.

What a plain LLM call costs

Here's the simplest LLM call in TypeScript with @google/genai:

import { GoogleGenAI } from '@google/genai';

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

const res = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: 'Summarize this article: ...',
});

console.log(res.text);

One request out, one response back. You pay for two things:

Input tokens for your prompt
Output tokens for the model's reply

That's it. Two numbers on your bill. If your prompt is 500 tokens and the answer
is 200, you pay for 700 tokens. Done.

Now add a single tool

Tools are how the model talks to the outside world. Calling an API, querying a
database, fetching a URL, anything. You describe each tool with a small JSON
schema, and the model can ask to "call" one mid-conversation. You actually run
the function, send the result back, and the model writes its final answer using
that result.

The basic version:

import { GoogleGenAI, Type } from '@google/genai';

const tools = [{
  functionDeclarations: [{
    name: 'getWeather',
    description: 'Get the weather of any city',
    parameters: {
      type: Type.OBJECT,
      properties: {
        location: { type: Type.STRING },
      },
      required: ['location'],
    },
  }],
}];

const first = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: 'What is the weather in Tokyo?',
  config: { tools },
});

console.log(first.text);
// → undefined

undefined?

The model didn't answer. It returned a structured request:

first.functionCalls
// → [{ name: 'getWeather', args: { location: 'Tokyo' } }]

This is the part that surprises people. The model got asked a question, and
instead of answering, it asked you to run a function. So you do that and
ship the result back:

const result = getWeather('Tokyo');  // { temperature: 23, condition: 'sunny' }

const second = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: [
    { role: 'user',  parts: [{ text: 'What is the weather in Tokyo?' }] },
    { role: 'model', parts: [{ functionCall: { name: 'getWeather', args: { location: 'Tokyo' } } }] },
    { role: 'user',  parts: [{ functionResponse: { name: 'getWeather', response: result } }] },
  ],
  config: { tools },
});

console.log(second.text);
// → "It's 23°C and sunny in Tokyo."

Two LLM calls. One question. That's the agent tax.

Why we can't just do it in one call

The first reaction (mine too): why can't the model just answer in one shot?

The reason is simple. The model can't predict what the tool will return. The
temperature in Tokyo isn't in its training data, the API hasn't been hit yet,
the result doesn't exist. You can't write "It's 23°C in Tokyo" before you know
it's 23°C.

So turn 1 is "decide what to do." Turn 2 is "use what you learned." They can't
be merged. The model has no memory between calls.

One exception is worth knowing about: server-side tools. Things like
googleSearch or urlContext in Gemini run inside Google's own servers, and
the API returns one merged response. From your side it looks like a single call.
You lose some control (you can't see exactly what got searched), but you save
a round trip.

Counting the actual tokens

Here's where the cost lives. Look at what turn 2 has to send compared to turn 1:

	Turn 1 in	Turn 1 out	Turn 2 in	Turn 2 out
System prompt	yes		yes, billed again
Tool schemas	yes		yes, billed again
User question	yes		yes, billed again
Model's tool call		yes	yes, as input
Your tool result			yes
Final answer				yes

Your system prompt and tool definitions get sent to the API twice. Turn 1
doesn't free you from re-sending everything in turn 2, because the model is
stateless. It forgets the whole conversation between calls.

Real numbers from my bookmark agent:

System prompt: ~200 tokens
4 tool declarations: ~400 tokens
User question: ~50 tokens
Tool result (a few rows from Postgres): ~300 tokens

Plain LLM call:    ~650 in  +  ~200 out  =  ~850 tokens
One-tool agent:   ~1300 in  +  ~230 out  = ~1530 tokens (about 1.8x)

And that's the best case. Exactly one tool call, no follow-ups. Real agents are
worse. A lot worse.

Real agents grow quadratically

The bookmark agent does three things on a new URL:

Fetch the page (fetchUrl tool)
Look for similar existing bookmarks in the DB (searchSimilar tool)
Pick a category from the user's existing taxonomy (getTaxonomy tool)

That's 4 LLM turns total. Ask, get tool calls, send back results, ask again,
get more calls, send results, finally write the summary.

What the cumulative input size looks like each turn:

Turn	What gets sent	Input tokens
1	system + schemas + URL	700
2	+ previous calls + `fetchUrl` result (~1500 of page)	2200
3	+ `searchSimilar` result	2400
4	+ `getTaxonomy` result	2600

Total input across all turns: about 7900 tokens to summarize one webpage.

For comparison, a plain generateContent({ contents: "summarize this:\n" + pageText })
costs ~1500 input + 200 output. About 1700 tokens.

Same task. Almost 5x the bill.

It gets worse. Cost grows quadratically with the number of turns, because
each turn replays everything that came before. A 10-turn agent isn't 10x the
cost. It's closer to 30x.

Three ways to stop the bleeding

You're not stuck. Here's what actually works.

1. Prompt caching

The biggest lever by far. Every major provider supports it now: OpenAI,
Anthropic, Google. The system prompt and tool schemas don't change between
turns, so cache them once and pay about 25% of the input cost on every reuse.

With @google/genai:

const cache = await ai.caches.create({
  model: 'gemini-2.5-flash',
  config: {
    systemInstruction: 'You are a bookmark organizer...',
    tools,  // these never change across turns
  },
});

const res = await ai.models.generateContent({
  model: 'gemini-2.5-flash',
  contents: history,
  config: { cachedContent: cache.name },
});

For my 4-turn flow this cuts input costs by roughly half. Anthropic and OpenAI
do the same thing with different syntax.

Gemini also has implicit caching. It auto-caches recent prefixes for you with
zero code changes. You just see cheaper retries. Check if your provider has it
on before reinventing the wheel.

2. Different model per turn

The "decide which tool to call" turn is dumb work. It barely needs reasoning.
It's pattern matching on a question. The final synthesis turn is where you
actually want a smart model.

// Cheap, fast: decides what to do
const decision = await ai.models.generateContent({
  model: 'gemini-2.5-flash-lite',
  // ...
});

// Smarter: writes the actual answer
const finalAnswer = await ai.models.generateContent({
  model: 'gemini-2.5-pro',
  // ...
});

In a 4-turn flow, three of the turns can run on the cheap model. Only the last
one, the user-facing answer, needs the expensive one. For high-volume agents
this saves more than caching does.

3. Parallel tool calls

The model can ask for multiple tools in a single response. Code I see in
tutorials usually does functionCalls[0] and silently drops the rest, turning
what could be one round trip into many.

The fix is one line of Promise.all:

const results = await Promise.all(
  resp.functionCalls.map(async (c) => ({
    name: c.name!,
    response: await dispatchers[c.name!](c.args),
  }))
);

For "summarize all my React bookmarks from last month," the model might call
searchBookmarks and getDateRange in parallel. Handle both, and you save a
whole round trip.

What you can't optimize away

Tools have a real cost, and they buy you real value. The reason you reach for
them is the same reason they're expensive. You're forcing the model to use
facts that exist outside its head instead of making them up.

A plain LLM call will happily tell you the weather in Tokyo. It'll just be
wrong.

Quick way to think about it when picking an architecture:

Plain LLM is a guess from training data. Cheap, fast, hallucinates.
Tools / agent is real data. Expensive, slower, honest.

Most apps shouldn't be agents. If your task is "summarize this text I'm pasting
in" or "rewrite this email," you don't need tools. You need one call. A lot of
agent frameworks make it really easy to add tools by default, which makes it
really easy to spend 5x what you should.

Tools earn their cost when you have side effects (writing to a DB, sending a
message), grounded data (today's weather, this user's bookmarks, current docs),
or chained reasoning where intermediate steps actually need verification.

They don't earn it on anything you could solve with one good prompt.

The receipt

Last week I added one tool to a Gemini call and watched the cost go from 850
tokens to 1530 for the same question. Once I started parallelizing calls and
caching the system prompt, I got the bookmark agent down to about 4500 tokens
across all four turns. Still 2.5x a plain call, but way better than the 7900
the naive version was burning.

Your agent isn't a smarter LLM. It's the same LLM with a longer receipt. Once
you can read the receipt, every optimization becomes obvious.

If you like my content support by like and share 💟 also dont forget to follow me on Twitter/X and LinkedIn. If you want me to connect, checkout my site. See you in next one.

DEV Community