I was building a small bookmark app last weekend. You send it a URL, Gemini
summarizes and tags the page, the result goes into Postgres. A few hundred lines
of TypeScript.
The first version cost almost nothing. One LLM call per URL, that's it. Then I
added "tools" so the model could fetch pages, look up similar bookmarks, or
check things against Google Search.
My token bill quadrupled.
That's where most people building agents land for the first time. Going from a
plain chat call to an agent loop is way more expensive than docs make it sound,
and the reason isn't obvious until you watch the round trips happen one by one.
Let's do that.
What a plain LLM call costs
Here's the simplest LLM call in TypeScript with @google/genai:
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });
const res = await ai.models.generateContent({
model: 'gemini-2.5-flash',
contents: 'Summarize this article: ...',
});
console.log(res.text);
One request out, one response back. You pay for two things:
- Input tokens for your prompt
- Output tokens for the model's reply
That's it. Two numbers on your bill. If your prompt is 500 tokens and the answer
is 200, you pay for 700 tokens. Done.
Now add a single tool
Tools are how the model talks to the outside world. Calling an API, querying a
database, fetching a URL, anything. You describe each tool with a small JSON
schema, and the model can ask to "call" one mid-conversation. You actually run
the function, send the result back, and the model writes its final answer using
that result.
The basic version:
import { GoogleGenAI, Type } from '@google/genai';
const tools = [{
functionDeclarations: [{
name: 'getWeather',
description: 'Get the weather of any city',
parameters: {
type: Type.OBJECT,
properties: {
location: { type: Type.STRING },
},
required: ['location'],
},
}],
}];
const first = await ai.models.generateContent({
model: 'gemini-2.5-flash',
contents: 'What is the weather in Tokyo?',
config: { tools },
});
console.log(first.text);
// → undefined
undefined?
The model didn't answer. It returned a structured request:
first.functionCalls
// → [{ name: 'getWeather', args: { location: 'Tokyo' } }]
This is the part that surprises people. The model got asked a question, and
instead of answering, it asked you to run a function. So you do that and
ship the result back:
const result = getWeather('Tokyo'); // { temperature: 23, condition: 'sunny' }
const second = await ai.models.generateContent({
model: 'gemini-2.5-flash',
contents: [
{ role: 'user', parts: [{ text: 'What is the weather in Tokyo?' }] },
{ role: 'model', parts: [{ functionCall: { name: 'getWeather', args: { location: 'Tokyo' } } }] },
{ role: 'user', parts: [{ functionResponse: { name: 'getWeather', response: result } }] },
],
config: { tools },
});
console.log(second.text);
// → "It's 23°C and sunny in Tokyo."
Two LLM calls. One question. That's the agent tax.
Why we can't just do it in one call
The first reaction (mine too): why can't the model just answer in one shot?
The reason is simple. The model can't predict what the tool will return. The
temperature in Tokyo isn't in its training data, the API hasn't been hit yet,
the result doesn't exist. You can't write "It's 23°C in Tokyo" before you know
it's 23°C.
So turn 1 is "decide what to do." Turn 2 is "use what you learned." They can't
be merged. The model has no memory between calls.
One exception is worth knowing about: server-side tools. Things like
googleSearch or urlContext in Gemini run inside Google's own servers, and
the API returns one merged response. From your side it looks like a single call.
You lose some control (you can't see exactly what got searched), but you save
a round trip.
Counting the actual tokens
Here's where the cost lives. Look at what turn 2 has to send compared to turn 1:
| Turn 1 in | Turn 1 out | Turn 2 in | Turn 2 out | |
|---|---|---|---|---|
| System prompt | yes | yes, billed again | ||
| Tool schemas | yes | yes, billed again | ||
| User question | yes | yes, billed again | ||
| Model's tool call | yes | yes, as input | ||
| Your tool result | yes | |||
| Final answer | yes |
Your system prompt and tool definitions get sent to the API twice. Turn 1
doesn't free you from re-sending everything in turn 2, because the model is
stateless. It forgets the whole conversation between calls.
Real numbers from my bookmark agent:
- System prompt: ~200 tokens
- 4 tool declarations: ~400 tokens
- User question: ~50 tokens
- Tool result (a few rows from Postgres): ~300 tokens
Plain LLM call: ~650 in + ~200 out = ~850 tokens
One-tool agent: ~1300 in + ~230 out = ~1530 tokens (about 1.8x)
And that's the best case. Exactly one tool call, no follow-ups. Real agents are
worse. A lot worse.
Real agents grow quadratically
The bookmark agent does three things on a new URL:
- Fetch the page (
fetchUrltool) - Look for similar existing bookmarks in the DB (
searchSimilartool) - Pick a category from the user's existing taxonomy (
getTaxonomytool)
That's 4 LLM turns total. Ask, get tool calls, send back results, ask again,
get more calls, send results, finally write the summary.
What the cumulative input size looks like each turn:
| Turn | What gets sent | Input tokens |
|---|---|---|
| 1 | system + schemas + URL | 700 |
| 2 | + previous calls + fetchUrl result (~1500 of page) |
2200 |
| 3 | + searchSimilar result |
2400 |
| 4 | + getTaxonomy result |
2600 |
Total input across all turns: about 7900 tokens to summarize one webpage.
For comparison, a plain generateContent({ contents: "summarize this:\n" + pageText })
costs ~1500 input + 200 output. About 1700 tokens.
Same task. Almost 5x the bill.
It gets worse. Cost grows quadratically with the number of turns, because
each turn replays everything that came before. A 10-turn agent isn't 10x the
cost. It's closer to 30x.
Three ways to stop the bleeding
You're not stuck. Here's what actually works.
1. Prompt caching
The biggest lever by far. Every major provider supports it now: OpenAI,
Anthropic, Google. The system prompt and tool schemas don't change between
turns, so cache them once and pay about 25% of the input cost on every reuse.
With @google/genai:
const cache = await ai.caches.create({
model: 'gemini-2.5-flash',
config: {
systemInstruction: 'You are a bookmark organizer...',
tools, // these never change across turns
},
});
const res = await ai.models.generateContent({
model: 'gemini-2.5-flash',
contents: history,
config: { cachedContent: cache.name },
});
For my 4-turn flow this cuts input costs by roughly half. Anthropic and OpenAI
do the same thing with different syntax.
Gemini also has implicit caching. It auto-caches recent prefixes for you with
zero code changes. You just see cheaper retries. Check if your provider has it
on before reinventing the wheel.
2. Different model per turn
The "decide which tool to call" turn is dumb work. It barely needs reasoning.
It's pattern matching on a question. The final synthesis turn is where you
actually want a smart model.
// Cheap, fast: decides what to do
const decision = await ai.models.generateContent({
model: 'gemini-2.5-flash-lite',
// ...
});
// Smarter: writes the actual answer
const finalAnswer = await ai.models.generateContent({
model: 'gemini-2.5-pro',
// ...
});
In a 4-turn flow, three of the turns can run on the cheap model. Only the last
one, the user-facing answer, needs the expensive one. For high-volume agents
this saves more than caching does.
3. Parallel tool calls
The model can ask for multiple tools in a single response. Code I see in
tutorials usually does functionCalls[0] and silently drops the rest, turning
what could be one round trip into many.
The fix is one line of Promise.all:
const results = await Promise.all(
resp.functionCalls.map(async (c) => ({
name: c.name!,
response: await dispatchers[c.name!](c.args),
}))
);
For "summarize all my React bookmarks from last month," the model might call
searchBookmarks and getDateRange in parallel. Handle both, and you save a
whole round trip.
What you can't optimize away
Tools have a real cost, and they buy you real value. The reason you reach for
them is the same reason they're expensive. You're forcing the model to use
facts that exist outside its head instead of making them up.
A plain LLM call will happily tell you the weather in Tokyo. It'll just be
wrong.
Quick way to think about it when picking an architecture:
- Plain LLM is a guess from training data. Cheap, fast, hallucinates.
- Tools / agent is real data. Expensive, slower, honest.
Most apps shouldn't be agents. If your task is "summarize this text I'm pasting
in" or "rewrite this email," you don't need tools. You need one call. A lot of
agent frameworks make it really easy to add tools by default, which makes it
really easy to spend 5x what you should.
Tools earn their cost when you have side effects (writing to a DB, sending a
message), grounded data (today's weather, this user's bookmarks, current docs),
or chained reasoning where intermediate steps actually need verification.
They don't earn it on anything you could solve with one good prompt.
The receipt
Last week I added one tool to a Gemini call and watched the cost go from 850
tokens to 1530 for the same question. Once I started parallelizing calls and
caching the system prompt, I got the bookmark agent down to about 4500 tokens
across all four turns. Still 2.5x a plain call, but way better than the 7900
the naive version was burning.
Your agent isn't a smarter LLM. It's the same LLM with a longer receipt. Once
you can read the receipt, every optimization becomes obvious.
If you like my content support by like and share 💟 also dont forget to follow me on Twitter/X and LinkedIn. If you want me to connect, checkout my site. See you in next one.

Top comments (0)