π Introduction
In this project, I built a voice-controlled AI agent that processes audio input, converts it into text, detects user intent, and performs actions such as file creation, code generation, summarization, and chat.
π§ System Architecture
The system follows a simple pipeline:
Audio Input β Speech-to-Text β Intent Detection β Action Execution β UI Output
π Speech-to-Text
I used OpenAI Whisper (local model) to convert audio into text. Whisper provides high accuracy even with different accents and noise.
π€ Intent Detection
The system analyzes the transcribed text and classifies it into:
- Create File
- Write Code
- Summarize Text
- General Chat
βοΈ Actions
Based on the detected intent, the system performs:
- File creation inside a safe output directory
- Code generation and saving into files
- Text summarization
- Chat responses
π» User Interface
I used Streamlit to build a simple and interactive UI that displays:
- Transcribed text
- Detected intent
- Action results
β‘ Challenges Faced
- Handling speech recognition errors
- Managing file safety using a restricted output directory
- Designing a clean UI pipeline
π― Conclusion
This project demonstrates how to build a local AI agent that integrates speech processing, NLP, and automation into a single system.
π Links
- GitHub Repository: https://github.com/Vedant-Jagtap/voice-ai-agent.git
- Demo Video: https://youtu.be/KwK0PrQG9Z4?si=bKxDWaHV6tQPZEwH
Top comments (0)