guanjiawei

Posted on May 12 • Originally published at guanjiawei.ai

The Most Expensive Waste in the Agent Era: GPUs Waiting on CPUs

#ai #agents #infra #cpu

Recently, I've been using Agents to run AI Infra experiments. After tallying up the numbers, across more than seven hundred rounds of experiments on actual hardware, just waiting for environments to spin up consumed thirty-five hours.

This made me start questioning a common assumption: with Agents being deployed at scale today, where exactly is the bottleneck?

At First I Thought the Model Was Slow

When GPT-5.4 first came out, I always felt it was sluggish. At the time, fast mode cost double the quota for 1.5x throughput. I turned it on for a while; it felt a bit faster, but nowhere near 1.5x. On paper, it was a bit of a loss. But OpenAI was running promotions back then and handing out credits generously, so I didn't think too much about it.

After 5.5 launched, fast mode became even more expensive: 2.5x quota for 1.5x speed. I tried it for a few more days and concluded it was almost completely useless: credits burned through at breakneck speed, while the perceived speed improvement was essentially zero.

I thought about why. 5.5 was already far more precise than 5.4 in its operations, cutting out most of the wasted motions and getting more done per unit of time. Spending more than double the money to squeeze out a bit more token-generation speed simply couldn't produce a noticeable difference.

What surprised me even more was something else: after monitoring my own workflow for a few days, I realized the model wasn't generating most of the time—it was waiting.

Waiting for what?

Seven Hundred Experiments, Thirty-Five Hours Devoured by Environment Setup

My area is AI inference Infra, so a typical experiment round looks like this: set up the environment, run the data, collect results, write logs.

The data-running phase is mainly about GPU performance; it's a variable factor with little room for optimization. Only after staring at logs for a few days did I realize that the hard, blocked time in every round was in environment startup: installing dependencies, configuring CUDA, starting services, compiling. This segment averaged three minutes per round.

Seven hundred rounds times three minutes is over two thousand minutes. Roughly thirty-five hours.

And my end-to-end time per round—designing the experiment, writing code, running data, doing analysis, writing notes—averaged only five to six minutes total. That means a substantial portion of the total time was the model sitting there idle, waiting for the CPU to finish setting up the environment.

Reading this data felt a bit surreal. The so-called intelligent agent doing experiments sounds very sophisticated, yet for half that time it was literally just sitting there doing nothing.

Not Unique to Experimental Scenarios

Put this into everyday work scenarios, and the problem becomes even more glaring.

Take a random example: you ask an Agent to do something in PowerPoint. The model generates the tool-calling instruction in maybe a few seconds—blazing fast. Then what? Wait for PowerPoint to launch. If the machine is a bit older or PowerPoint is cold-starting, just opening it takes a minute. After it opens, giving instructions, drawing, adjusting positions—another 30 seconds.

And there's almost no way to parallelize any of this. Every step waits for the previous step's result. If the previous step didn't work well, it has to be revised; only after revising can it proceed to the next step. Every interaction is a serial dependency. On the surface the Agent is automating work, but in reality it's constantly blocking and waiting.

I did the math. On the model side, reading the feedback you provide—at a prefill speed of 2,000 tokens/s, roughly 10,000 Chinese characters takes about five or six seconds to read. Outputting instructions at the current common rate of 30 tokens/s, 1K tokens takes about 35 seconds. Reading plus writing is around 40 seconds per round.

But the tool side also takes 30 to 40 seconds. The two segments are roughly equal. After the model finishes generating, it just sits there waiting. It waits for the result to come back, then analyzes the next step.

This is the real rhythm of current Agent workflows.

No Matter How Fast the Model Is, a Slow CPU Means Waste

So let's assume we push model inference speed from 30 tokens/s to 100 tokens/s.

Sounds great—triple. But the cost isn't just triple. You need to stack more GPUs, use higher-frequency cards, run more aggressive parallelization strategies; the total cost goes up far more than triple.

What's the payoff? That 35-second model time shrinks to 10 seconds. A round goes from 40 seconds to 15 seconds.

But the tool side is still 30 seconds. The whole round becomes execution 15 seconds + waiting 30 seconds = 45 seconds.

Note this ratio: for more than 60% of the time, the GPU is idle. You spent a fortune to triple GPU speed, and two-thirds of the time it's just waiting on the CPU.

This waste is greatly amplified in Agent scenarios. Because a workflow doesn't call just one tool. An Agent might open PowerPoint, then open Word to grab something, then open a browser to search for an image, then come back to edit. Every step is cold start plus execution plus waiting. Strung together, it's all CPU-intensive and IO-intensive work.

This Year, the Hardware World Is Working on the Same Thing

At first I thought this was just my own subjective feeling. Later I realized the entire hardware world has been revolving around this issue all year.

Intel publicly stated that the server CPU:GPU ratio has already tightened from 1:8 to 1:4, and will eventually reach 1:1. Nvidia directly called out the CPU as the bottleneck for agentic workloads at GTC this March, and followed up at CES 2026 by launching the Vera CPU—88-core Olympus, 1.2 TB/s bandwidth—purpose-built for agentic orchestration and tool scheduling. For the first time, Arm is entering the server CPU space itself. In that massive OpenAI–Nvidia deal, it was written plainly: hundreds of thousands of GPUs, plus tens of millions of CPUs.

Server CPUs have already risen 20% since March this year; analysts expect another 8 to 10 points in the second half. Intel is shifting capacity from consumer CPUs to Xeon.

Why is everyone moving in sync? Nvidia laid out that workflow clearly in its own technical blog: take the task, pull context, call the model, parse output, decide next step, call tools, wait for IO, process results, assemble the next round's prompt, call the model again. Except for those two moments of calling the model, which are GPU jobs, almost everything else runs on the CPU.

Looked at Another Way, AI Has Exploded Demand for Traditional Compute

We used to think that with the rise of AI, demand for underlying infrastructure would be reshaped by AI. GPUs, HBM, interconnects—these would become more valuable, while traditional components like CPUs would be whatever.

Now it looks completely inverted. AI isn't replacing traditional compute; it's exploding demand for traditional compute. The more complex the Agent, the more tool calls, the deeper the serial dependencies—the higher the requirements for CPU, memory bandwidth, and IO.

Underneath is actually a simple cost logic. Previously, if PowerPoint took an extra minute to start, a human was waiting. So you wait; the loss was just human annoyance. Now if it's an Agent waiting, the GPU behind it is burning money every second; waiting a minute costs dozens of times more than a human waiting.

Whoever can minimize GPU waiting time wins on end-to-end cost-performance. This will eventually propagate to data center topology, scheduling strategies, inference service design philosophy, and even how every company selects hardware and signs contracts. This propagation chain has only just begun.

For Those Working on Infra and Agent Engineering

A few things to think about ahead of time.

Don't equate fast models with fast end-to-end. A workflow's real bottleneck may not be the model at all, but the tool cold start you've been ignoring. Next time someone tells you "our model does X TPS," ask them back: how long is your tool chain's average cold start?

Don't cut budgets for CPU, memory, and IO. Run real workflows for a while and you'll know: it's not the CPU that's idle, it's the GPU. Before buying GPUs, take a look at whether the CPU is already getting crushed.

Take a look back at your tool chain's cold start times. Pre-warm what can be pre-warmed, reuse what can be reused. Agent workflows are mostly serial, but the startup step almost always has room for optimization.

I'll be focusing heavily on this myself over the coming period. First, I'll reclaim those thirty-five hours.

References

原文链接：https://guanjiawei.ai/en/blog/agent-cpu-bottleneck

Top comments (1)

Max Quimby • May 12

The 700-round measurement is the kind of unglamorous data we need more of in this space — thank you for actually instrumenting it. The CPU:GPU ratio shift is real, but I'd argue most teams will hit a software ceiling well before they hit a hardware one. Three patterns that bought us back a surprising amount of GPU utilization without changing nodes:

Speculative tool execution — when the agent's next tool call is predictable from the partial decode (file reads, well-formed search queries), kick it off before the model finishes generating. Roughly 15–20% wall-clock win for read-heavy agents.
Concurrent agents on a shared serving pool — one agent's CPU phase is another agent's GPU phase. A batch scheduler that interleaves N agents on M GPUs hides the gap better than tightening the CPU:GPU ratio per node.
Pre-warmed sandboxes — your "35 hours of env setup" line jumped out. Container start dominates if you're spinning up per session. A warm pool with copy-on-write filesystems cut our setup tail from ~12s to ~400ms.

The Vera CPU story is interesting but I suspect Nvidia is partly responding to scheduling problems that better orchestration software could solve first. Curious what fraction of your 35h was actually CPU-bound vs IO-bound on cold caches.