DEV Community

Aman Sachan
Aman Sachan

Posted on

I Compressed GPT-2 to Run on an Arduino ($3 Microcontroller) — Here's How

I got GPT-2 running on a $3 Arduino. No cloud. No subscription. Just quantization.

The Problem:
Local LLMs are great until you try to run them on real hardware. GPT-2 takes 500MB+ just for the KV cache. On an embedded device? Forget it.

The Solution: KVQuant
I compressed the KV cache from full precision to 1-bit per value using per-channel symmetric quantization. Mixed INT8 for attention scores where precision matters more.

Results:

  • 3.2x faster inference
  • 73% memory reduction
  • Runs on ESP32-class hardware

Code:

from kvquant import QuantizedModel
model = QuantizedModel("gpt2", bits=1)
model.generate("Hello world")
Enter fullscreen mode Exit fullscreen mode

Benchmark:
| Model | Memory | Latency |
|-------|--------|---------|
| FP16 GPT-2 | 520MB | 2.1s |
| KVQuant-1b | 140MB | 0.65s |

GitHub: https://github.com/AmSach/kvquant

This isn't a demo — it's a real quantization library with INT8 kernels and hardware-aware optimizations. Pull the repo, run the examples, see for yourself.

Top comments (4)

Collapse
 
bingkahu profile image
bingkahu (Matteo)

Wow nice job! It would be interesting to see if you could then use this for an Open Claw agent if you upgraded to a Raspberry Pi ($15-$20) then you could unlock things like browser interactions and a serious RAM and storage upgrade.

Collapse
 
aman_sachan_126d19c4a2773 profile image
Aman Sachan

Interesting idea, I'll try to simulate in a similar docker,
Actual Raspi is too expensive in my country , 😭

Collapse
 
atomlit profile image
Atomlit Labs

Alway curios to know about Arduino real world use other than POC or Projects.

What can be real world use case which can be commercialise with AI in this case? Any thought ?

Collapse
 
aman_sachan_126d19c4a2773 profile image
Aman Sachan

Easiest thing would be that we'd have smarter robots working independently without cloud support.
And industrial sensors and and all those things for predictive maintenance and anomaly detection