DEV Community

Vilius
Vilius

Posted on

1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4

1-bit, 545 megabytes, zero API keys — local AI that beats GPT-5.4

By Vilius Vystartas | May 2026

I ran the same 10 agent coding tasks against 8 locally-running models on my Mac. No cloud, no API keys, no per-token billing. The results surprised me enough that I ran them twice.

The leaderboard

Model Bits Size Score Time
Qwen 3.5 9B 4-bit ~5GB 83% 190s
AgenticQwen 8B 4-bit ~5GB 82% 189s
Bonsai 4B 1-bit 545MB 80% 18s
Ternary Bonsai 1.7B 2-bit 442MB 80% 10s
Bonsai 8B 1-bit 1.1GB 80% 15s
Ternary Bonsai 4B 2-bit 1.0GB 80% 20s
Ternary Bonsai 8B 2-bit 2.1GB 78% 22s
Bonsai 1.7B 1-bit 237MB 73% 8s

A 545MB model beats GPT-5.4

Bonsai 4B at 1-bit quantization scores 80% on the same tasks where GPT-5.4 scored 75%. Half a gigabyte. No data center. Your laptop processes every request locally, zero latency. It's 3x faster than the Qwen models because there's less to compute.

4-bit controls tie Claude

The 4-bit Qwen models at ~5GB score 82-83% — matching Claude Sonnet 4's cloud performance. On a Mac. These aren't toys.

1-bit vs 2-bit (ternary): the extra bit is dead weight

At the 1.7B size, ternary helps — 80% vs 73%. But at 4B and 8B, 1-bit and 2-bit perform identically (80%). That extra bit costs double the disk (1.0GB vs 545MB, 2.1GB vs 1.1GB) for zero gain. At larger model sizes, 1-bit quantization has already captured everything the model can offer.

What this means

You can run an agent coding model that beats GPT-5.4 on a laptop with no internet. For regulated industries — healthcare, finance, government — this removes the compliance headache. No data leaves the device. No vendor API agreement to negotiate. No per-request billing to track.

The Bonsai findings are also on benchmarks.workswithagents.dev, refreshed with each run. Alongside the cloud models for direct comparison.

I didn't expect a 545MB quantized model to beat a cutting-edge cloud API. But here we are.

Top comments (1)

Collapse
 
chris_597825ce72dc83f7b31 profile image
Chris Widmer

The quantization story is genuinely impressive. The part that gets complicated next is chaining these models together.
A 545MB Bonsai model running locally for code generation, a local clinical NER model for entity extraction, a local compliance checker - each one has a completely different input format, output schema, and field naming conventions. The coding model returns structured diffs. The NER model returns wordpiece tokens that need reassembly. The compliance checker expects a flat obligations array with specific field names none of the other models use.
Right now the answer is custom connector code between every pair. Three models means two connectors. Ten models means 45. Each one breaks when any model updates its schema. For regulated industries running entirely local pipelines - which is exactly the use case this post describes - that maintenance burden compounds fast.
The quantization problem is solved. The schema translation problem between chained local specialist models is the next frontier.