The second round of the Works With Agents agent coding benchmark is in — 32 models tested this time, up from 10. And the results are not what anyone expected.
The headline: tiny models won
| Rank | Model | Score |
|---|---|---|
| 🥇 | SmolLM3 3B | 93.3 |
| 🥈 | Phi-4-mini | 90.0 |
| 🥉 | Claude Sonnet 4 | 85.0 |
| 4 | Qwen2.5 1.5B | 85.0 |
| 5 | Qwen2.5 3B | 85.0 |
| 6 | Granite 3.2 2B | 82.5 |
| 7 | Ministral 3B | 81.7 |
| 8 | Mistral Large 3 | 79.6 |
| 9 | Gemma 4 31B | 78.3 |
| 10 | Gemma 4 26B A4B | 78.3 |
A 3-billion-parameter model from Hugging Face scored 93.3 — eight points ahead of Claude Sonnet 4. Phi-4-mini (also a tiny model) took second at 90.0. Qwen2.5's 1.5B and 3B variants tied Claude at 85.0.
Frontier model results
| Model | Score |
|---|---|
| Claude Sonnet 4 | 85.0 |
| Gemini 2.5 Flash | 76.4 |
| GPT-5.4 | 76.6 |
| Kimi K2.6 | 75.0 |
| Grok 4.20 | 75.0 |
| MiniMax M2.7 | 69.9 |
| DeepSeek V4 Flash | 60.0 |
| GPT-5.5 | 60.0 |
| GPT-5.4 Pro | 51.6 |
| GPT-5.5 Pro | 43.3 |
| DeepSeek V4 Pro | 38.3 |
Grok 4.20 debuted at 75.0 — tied with Kimi K2.6, ahead of its Fast sibling (74.9). DeepSeek V4 Pro scored 38.3, well below its Flash variant. GPT-5.5 Pro and GPT-5.4 Pro both underperformed their base models substantially.
What the benchmark tests
The task evaluates real agent coding over 12 rounds:
- Multi-file edits (Python, shell scripts)
- Git operations (clone, branch, commit)
- Shell command execution
- Bash scripting with pipes and redirects
- Recovering from errors
Score = weighted average of correctness (70%) and efficiency (30%). Models lose points for failed tool calls, wrong commands, and unnecessary steps.
The bottom of the table
| Model | Score |
|---|---|
| DeepSeek-R1 1.5B | 27.5 |
| Qwen3.5 0.8B | 26.0 |
| Google Lyria 3 Pro | 8.3 |
| Google Lyria 3 Clip | 0.0 |
The smallest models (sub-2B reasoning models) couldn't complete basic tool sequences. Google's Lyria models in particular struggled — Lyria 3 Clip scored zero, unable to produce any working output.
What this means
Small models are getting dangerously good at agentic coding. SmolLM3 3B — a model you can run on a laptop — outperformed every frontier model by a wide margin. The benchmark suggests model size isn't the bottleneck for agent coding ability.
Full results and methodology: benchmarks.workswithagents.dev
The benchmark runs continuously — new models are added as they become available. If you're building a model that should be tested, the API is open.
Top comments (0)