Benchmark Results: SmolLM3 3B, Phi-4-mini, DeepSeek V4, Grok 4.20 — Agent Coding Tested

#ai #llm #benchmark #agents

The second round of the Works With Agents agent coding benchmark is in — 32 models tested this time, up from 10. And the results are not what anyone expected.

The headline: tiny models won

Rank	Model	Score
🥇	SmolLM3 3B	93.3
🥈	Phi-4-mini	90.0
🥉	Claude Sonnet 4	85.0
4	Qwen2.5 1.5B	85.0
5	Qwen2.5 3B	85.0
6	Granite 3.2 2B	82.5
7	Ministral 3B	81.7
8	Mistral Large 3	79.6
9	Gemma 4 31B	78.3
10	Gemma 4 26B A4B	78.3

A 3-billion-parameter model from Hugging Face scored 93.3 — eight points ahead of Claude Sonnet 4. Phi-4-mini (also a tiny model) took second at 90.0. Qwen2.5's 1.5B and 3B variants tied Claude at 85.0.

Frontier model results

Model	Score
Claude Sonnet 4	85.0
Gemini 2.5 Flash	76.4
GPT-5.4	76.6
Kimi K2.6	75.0
Grok 4.20	75.0
MiniMax M2.7	69.9
DeepSeek V4 Flash	60.0
GPT-5.5	60.0
GPT-5.4 Pro	51.6
GPT-5.5 Pro	43.3
DeepSeek V4 Pro	38.3

Grok 4.20 debuted at 75.0 — tied with Kimi K2.6, ahead of its Fast sibling (74.9). DeepSeek V4 Pro scored 38.3, well below its Flash variant. GPT-5.5 Pro and GPT-5.4 Pro both underperformed their base models substantially.

What the benchmark tests

The task evaluates real agent coding over 12 rounds:

Multi-file edits (Python, shell scripts)
Git operations (clone, branch, commit)
Shell command execution
Bash scripting with pipes and redirects
Recovering from errors

Score = weighted average of correctness (70%) and efficiency (30%). Models lose points for failed tool calls, wrong commands, and unnecessary steps.

The bottom of the table

Model	Score
DeepSeek-R1 1.5B	27.5
Qwen3.5 0.8B	26.0
Google Lyria 3 Pro	8.3
Google Lyria 3 Clip	0.0

The smallest models (sub-2B reasoning models) couldn't complete basic tool sequences. Google's Lyria models in particular struggled — Lyria 3 Clip scored zero, unable to produce any working output.

What this means

Small models are getting dangerously good at agentic coding. SmolLM3 3B — a model you can run on a laptop — outperformed every frontier model by a wide margin. The benchmark suggests model size isn't the bottleneck for agent coding ability.

Full results and methodology: benchmarks.workswithagents.dev

The benchmark runs continuously — new models are added as they become available. If you're building a model that should be tested, the API is open.