1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4
By Vilius Vystartas | May 2026
I ran the same 10 agent coding tasks against 8 locally-running models on my Mac. No cloud, no API keys, no per-token billing. The results surprised me enough that I ran them twice.
The leaderboard
| Model | Bits | Size | Score | Time |
|---|---|---|---|---|
| Qwen 3.5 9B | 4-bit | ~5GB | 83% | 190s |
| AgenticQwen 8B | 4-bit | ~5GB | 82% | 189s |
| Bonsai 4B | 1-bit | 545MB | 80% | 18s |
| Ternary Bonsai 1.7B | 2-bit | 442MB | 80% | 10s |
| Bonsai 8B | 1-bit | 1.1GB | 80% | 15s |
| Ternary Bonsai 4B | 2-bit | 1.0GB | 80% | 20s |
| Ternary Bonsai 8B | 2-bit | 2.1GB | 78% | 22s |
| Bonsai 1.7B | 1-bit | 237MB | 73% | 8s |
A 545MB model beats GPT-5.4
Bonsai 4B at 1-bit quantization scores 80% on the same tasks where GPT-5.4 scored 75%. Half a gigabyte. No data center. Your laptop processes every request locally, zero latency. It's 3x faster than the Qwen models because there's less to compute.
4-bit controls tie Claude
The 4-bit Qwen models at ~5GB score 82-83% — matching Claude Sonnet 4's cloud performance. On a Mac. These aren't toys.
1-bit vs 2-bit (ternary): the extra bit is dead weight
At the 1.7B size, ternary helps — 80% vs 73%. But at 4B and 8B, 1-bit and 2-bit perform identically (80%). That extra bit costs double the disk (1.0GB vs 545MB, 2.1GB vs 1.1GB) for zero gain. At larger model sizes, 1-bit quantization has already captured everything the model can offer.
What this means
You can run an agent coding model that beats GPT-5.4 on a laptop with no internet. For regulated industries — healthcare, finance, government — this removes the compliance headache. No data leaves the device. No vendor API agreement to negotiate. No per-request billing to track.
The Bonsai findings are also on benchmarks.workswithagents.dev, refreshed with each run. Alongside the cloud models for direct comparison.
I didn't expect a 545MB quantized model to beat a cutting-edge cloud API. But here we are.
Top comments (1)
The quantization story is genuinely impressive. The part that gets complicated next is chaining these models together.
A 545MB Bonsai model running locally for code generation, a local clinical NER model for entity extraction, a local compliance checker - each one has a completely different input format, output schema, and field naming conventions. The coding model returns structured diffs. The NER model returns wordpiece tokens that need reassembly. The compliance checker expects a flat obligations array with specific field names none of the other models use.
Right now the answer is custom connector code between every pair. Three models means two connectors. Ten models means 45. Each one breaks when any model updates its schema. For regulated industries running entirely local pipelines - which is exactly the use case this post describes - that maintenance burden compounds fast.
The quantization problem is solved. The schema translation problem between chained local specialist models is the next frontier.