DEV Community

Vilius
Vilius

Posted on

Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested

The second round of the Works With Agents agent coding benchmark is in — 32 models tested this time, up from 10. And the results are not what anyone expected.

The headline: tiny models won

Rank Model Score
🥇 SmolLM3 3B 93.3
🥈 Phi-4-mini 90.0
🥉 Claude Sonnet 4 85.0
4 Qwen2.5 1.5B 85.0
5 Qwen2.5 3B 85.0
6 Granite 3.2 2B 82.5
7 Ministral 3B 81.7
8 Mistral Large 3 79.6
9 Gemma 4 31B 78.3
10 Gemma 4 26B A4B 78.3

A 3-billion-parameter model from Hugging Face scored 93.3 — eight points ahead of Claude Sonnet 4. Phi-4-mini (also a tiny model) took second at 90.0. Qwen2.5's 1.5B and 3B variants tied Claude at 85.0.

Frontier model results

Model Score
Claude Sonnet 4 85.0
Gemini 2.5 Flash 76.4
GPT-5.4 76.6
Kimi K2.6 75.0
Grok 4.20 75.0
MiniMax M2.7 69.9
DeepSeek V4 Flash 60.0
GPT-5.5 60.0
GPT-5.4 Pro 51.6
GPT-5.5 Pro 43.3
DeepSeek V4 Pro 38.3

Grok 4.20 debuted at 75.0 — tied with Kimi K2.6, ahead of its Fast sibling (74.9). DeepSeek V4 Pro scored 38.3, well below its Flash variant. GPT-5.5 Pro and GPT-5.4 Pro both underperformed their base models substantially.

What the benchmark tests

The task evaluates real agent coding over 12 rounds:

  • Multi-file edits (Python, shell scripts)
  • Git operations (clone, branch, commit)
  • Shell command execution
  • Bash scripting with pipes and redirects
  • Recovering from errors

Score = weighted average of correctness (70%) and efficiency (30%). Models lose points for failed tool calls, wrong commands, and unnecessary steps.

The bottom of the table

Model Score
DeepSeek-R1 1.5B 27.5
Qwen3.5 0.8B 26.0
Google Lyria 3 Pro 8.3
Google Lyria 3 Clip 0.0

The smallest models (sub-2B reasoning models) couldn't complete basic tool sequences. Google's Lyria models in particular struggled — Lyria 3 Clip scored zero, unable to produce any working output.

What this means

Small models are getting dangerously good at agentic coding. SmolLM3 3B — a model you can run on a laptop — outperformed every frontier model by a wide margin. The benchmark suggests model size isn't the bottleneck for agent coding ability.

Full results and methodology: benchmarks.workswithagents.dev

The benchmark runs continuously — new models are added as they become available. If you're building a model that should be tested, the API is open.

Top comments (0)