We used to worry about training costs. Now the bill for checking if the model works is becoming the line item that kills budgets.
The Holistic Agent Leaderboard recently spent $40,000 to run 21,730 agent rollouts across nine models and nine benchmarks. A single GAIA run on a frontier model can hit $2,829 before you even think about caching. Exgentic's sweep across agent configurations found a 33x cost spread on identical tasks.
Static benchmarks could be compressed. Flash-HELM showed 100-200x compute reduction preserved rankings. Agent benchmarks broke that assumption. When your evaluation is a multi-turn rollout with tool calls and stateful interaction, each item is the expensive object.
On HAL's Online Mind2Web benchmark, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. That is a 9x cost difference for two percentage points.
Agent benchmarks measure a model x scaffold x token-budget product. CLEAR found that accuracy-optimal configurations cost 4.4 to 10.8x more than Pareto-efficient alternatives. The best result on a leaderboard is often just the most expensive configuration someone was willing to pay for.
The democratization narrative in AI has always been fragile. Open weights helped. Open datasets helped. But open evaluation is becoming a luxury good.
A grad student can download Llama 4 and fine-tune it on a single GPU. They cannot reproduce the HAL leaderboard without institutional backing. The verification layer of the scientific process is being priced out of reach.
What we need is transparency about costs alongside scores. A leaderboard that shows dollars per point of accuracy. Until then, evaluation will continue its drift from quality control to capital allocation. And the people best positioned to know which models actually work will be the ones with the deepest pockets, not the sharpest insights.
Top comments (1)
The $40k for 21,730 rollouts number is brutal. The cost structure of eval is getting attention, but the harder problem is that cheaper proxies are often wrong. You can approximate a $2 LLM-judge run with a $0.001 heuristic, but you'll miss the cases that matter. How are teams handling this trade-off in practice?