I got GPT-2 running on a $3 Arduino. No cloud. No subscription. Just quantization.
The Problem:
Local LLMs are great until you try to run them on real hardware. GPT-2 takes 500MB+ just for the KV cache. On an embedded device? Forget it.
The Solution: KVQuant
I compressed the KV cache from full precision to 1-bit per value using per-channel symmetric quantization. Mixed INT8 for attention scores where precision matters more.
Results:
- 3.2x faster inference
- 73% memory reduction
- Runs on ESP32-class hardware
Code:
from kvquant import QuantizedModel
model = QuantizedModel("gpt2", bits=1)
model.generate("Hello world")
Benchmark:
| Model | Memory | Latency |
|-------|--------|---------|
| FP16 GPT-2 | 520MB | 2.1s |
| KVQuant-1b | 140MB | 0.65s |
GitHub: https://github.com/AmSach/kvquant
This isn't a demo — it's a real quantization library with INT8 kernels and hardware-aware optimizations. Pull the repo, run the examples, see for yourself.
Top comments (4)
Wow nice job! It would be interesting to see if you could then use this for an Open Claw agent if you upgraded to a Raspberry Pi ($15-$20) then you could unlock things like browser interactions and a serious RAM and storage upgrade.
Interesting idea, I'll try to simulate in a similar docker,
Actual Raspi is too expensive in my country , 😭
Alway curios to know about Arduino real world use other than POC or Projects.
What can be real world use case which can be commercialise with AI in this case? Any thought ?
Easiest thing would be that we'd have smarter robots working independently without cloud support.
And industrial sensors and and all those things for predictive maintenance and anomaly detection