DEV Community

Vilius
Vilius

Posted on

We Tested 10 Untested LLMs on Agent Coding — The Results Are In

We Tested 10 Untested LLMs on Agent Coding — The Results Are In

Yesterday I promised to benchmark 10 LLMs that have never been tested on real agent coding tasks. I ran all 10 overnight. Some surprised me. Some embarrassed themselves.

The board

10 models. 10 tasks each. Tasks are real agent work: parse JSON, write regex, fix a bug, query SQL, handle errors. Full pass requires correct, working code.

Model Score Pass/Fail Cost/task
Grok 4.20 75.0% 6 pass / 3 partial / 1 fail $0.0003
Grok 4.1 Fast 74.9% 6/2/2 $0.0009
Xiaomi MiMo V2.5 Pro 68.2% 7/0/3 $0.001
Ring 2.6 (free) 65.0% 6/1/3 free
DeepSeek V4 Flash 60.0% 4/3/3 $0.0001
GPT-5.4 Pro 51.6% 5/1/4 $0.06
GPT-5.5 Pro 43.3% 4/1/5 $0.065
DeepSeek V4 Pro 38.3% 4/0/6 $0.001
Google Lyria 3 Pro 8.3% 1/0/9 free (preview)
Google Lyria 3 Clip 0.0% 0/0/10 free (preview)

Total cost: $1.37 for the entire run.

What jumped out

Grok 4.20 won. Not close enough to call it dominant, but it's the fastest by far — 14.5 seconds for all 10 tasks. Grok 4.1 Fast scored nearly identically at 225 seconds. Same family, wildly different speed profiles.

The "Pro" suffix is a trap. GPT-5.4 Pro scored 51.6%. Regular GPT-5.4 scored 76.6% on the same tasks. GPT-5.5 Pro scored 43.3%. Regular GPT-5.5 scored 60%. The Pro variants are slower, more expensive, and worse at this specific workload. If you're building agents, the base models are better.

DeepSeek V4 Flash beat DeepSeek V4 Pro — 60% vs 38%. Flash is also cheaper. For agent coding, smaller/faster beats bigger/slower again.

Ring 2.6 is free and beats paid models. Six passes, one partial, $0.00. Outperforms both GPT Pros, DeepSeek V4 Pro, and Lyria completely.

Google Lyria 3 is not ready. Clip failed every single task with 502 errors. Pro barely scored. Both are marked "preview" on OpenRouter. Fair enough — but worth knowing before you build on them.

Raw scores with context

For comparison, here's where these new models land against the existing leaderboard:

  • Claude Sonnet 4 — 85.0%
  • Mistral Large 3 — 79.6%
  • Gemma 4 31B — 78.3%
  • Gemma 4 26B A4B — 78.3%
  • Qwen 3.6 Plus — 76.6%
  • GPT-5.4 — 76.6%
  • Gemini 2.5 Flash — 76.4%
  • Kimi K2.6 — 75.0%
  • Grok 4.20 — 75.0% ← new
  • Grok 4.1 Fast — 74.9% ← new
  • MiniMax M2.7 — 69.9%
  • Xiaomi MiMo V2.5 Pro — 68.2% ← new
  • Ring 2.6 — 65.0% ← new (free)
  • GPT-5.5 — 60.0%
  • DeepSeek V4 Flash — 60.0% ← new
  • GPT-5.4 Pro — 51.6% ← new
  • GPT-5.5 Pro — 43.3% ← new
  • DeepSeek V4 Pro — 38.3% ← new
  • Lyria 3 Pro — 8.3% ← new
  • Lyria 3 Clip — 0.0% ← new

What this means

If I was building an agent today and had to pick a model:

For reliability: Claude Sonnet 4 (85%) or Mistral Large 3 (79.6%). These aren't new — they've been at the top since the first benchmark.

For speed at good quality: Grok 4.20. 75% score in 14.5 seconds. That's under 2 seconds per task.

For free: Ring 2.6 if you qualify for OpenRouter's free tier. 65% at $0 is hard to beat.

What to avoid: The "Pro" suffix on GPT models. Google's Lyria previews. DeepSeek V4 Pro if Flash is cheaper and better.

All results are live at workswithagents.dev/benchmarks — updated daily. Full interactive dashboard with local models at benchmarks.workswithagents.dev.

One thing I'm watching

The Pro variants of GPT-5.4 and GPT-5.5 should theoretically be better. They're not. This might mean OpenAI optimized these for something other than quick-turn agent coding. Or it might mean the base models are just better tuned. Either way — don't assume Pro means better. Test it.

Top comments (0)