I Compressed GPT-2 to Run on an Arduino ($3 Microcontroller) — Here's How

#python #ai #machinelearning #opensource

I got GPT-2 running on a $3 Arduino. No cloud. No subscription. Just quantization.

The Problem:
Local LLMs are great until you try to run them on real hardware. GPT-2 takes 500MB+ just for the KV cache. On an embedded device? Forget it.

The Solution: KVQuant
I compressed the KV cache from full precision to 1-bit per value using per-channel symmetric quantization. Mixed INT8 for attention scores where precision matters more.

Results:

3.2x faster inference
73% memory reduction
Runs on ESP32-class hardware

Code:

from kvquant import QuantizedModel
model = QuantizedModel("gpt2", bits=1)
model.generate("Hello world")

Benchmark:
| Model | Memory | Latency |
|-------|--------|---------|
| FP16 GPT-2 | 520MB | 2.1s |
| KVQuant-1b | 140MB | 0.65s |

GitHub: https://github.com/AmSach/kvquant

This isn't a demo — it's a real quantization library with INT8 kernels and hardware-aware optimizations. Pull the repo, run the examples, see for yourself.

Top comments (4)

bingkahu (Matteo) • Apr 30

Wow nice job! It would be interesting to see if you could then use this for an Open Claw agent if you upgraded to a Raspberry Pi ($15-$20) then you could unlock things like browser interactions and a serious RAM and storage upgrade.

Aman Sachan • May 4

Interesting idea, I'll try to simulate in a similar docker,
Actual Raspi is too expensive in my country , 😭

Atomlit Labs • Apr 30

Alway curios to know about Arduino real world use other than POC or Projects.

What can be real world use case which can be commercialise with AI in this case? Any thought ?

Aman Sachan • Apr 30

Easiest thing would be that we'd have smarter robots working independently without cloud support.
And industrial sensors and and all those things for predictive maintenance and anomaly detection