Building a Voice-Controlled Local AI Agent (End-to-End)

#ai #machinelearning #python #opensource

I recently built a fully local, voice-controlled AI agent that can listen to audio, understand user intent, and execute real actions like creating files or generating code. Here’s a quick breakdown of how it works and what I learned along the way.

🧠 Architecture Overview

The system follows a clean pipeline:

Audio → Speech-to-Text → Intent Detection → Tool Execution → UI Output

Speech-to-Text (STT): I used Faster-Whisper running locally for accurate and fast transcription.
Intent Detection: A lightweight LLM (phi3:latest via Ollama) classifies what the user wants.
Tool Execution: Based on intent, the system triggers actions like:
Creating files
Writing code
Summarizing text
General chat
Frontend: Built with Flask + simple UI to visualize each stage of the pipeline.
🤖 Why These Models?
🎤 Faster-Whisper
Works locally (no API dependency)
Handles multiple audio formats
Good balance of speed and accuracy on CPU
🧠 Phi-3 (via Ollama)
Lightweight (~2–4GB runtime)
Fast inference → avoids timeouts
Reliable for structured outputs (JSON)

Initially, I tried larger models like Qwen, but they caused latency issues and frequent timeouts on my hardware. Switching to Phi-3 made the system much more responsive.

⚙️ Key Features
Supports both mic input and file upload
Multi-intent handling (e.g., “create a file and write code”)
Safety sandbox (output/ folder)
Chat history memory
Streaming responses for better UX
😵 Challenges I Faced

❌ Model Timeouts

Large models were too slow for real-time interaction. Requests would simply hang or return empty responses.

👉 Fix: Switched to a smaller model and reduced token generation.

🎤 Speech-to-Text Errors

Whisper sometimes misheard:

hello.py → hello.5
dot py → .5

👉 Fix: Added preprocessing rules to normalize filenames.

🧠 Incorrect Intent Detection

The model often classified everything as “chat,” even when the user clearly wanted to create a file.

👉 Fix: Added rule-based overrides (hybrid system = rules + LLM).

🔄 Streaming Bugs

Enabling streaming broke responses because I was still parsing them like normal JSON.

Fix: Switched to chunk-based parsing for streaming responses.

What I Learned
Smaller, faster models are often better for real-time systems.
LLMs alone are not reliable for control logic — rules are essential.
Preprocessing (especially for speech input) is critical.
Good UX (like streaming) makes a huge difference in perception.

Final Thoughts
This project taught me how to build a practical AI system—not just a model, but a full pipeline that works reliably in real-world conditions.

If I were to extend this further, I’d add:
Real-time voice streaming
Persistent memory (vector DB)
Better UI with live token streaming

Thanks for reading! 🚀