Hopkins Jesse

Posted on May 12

I Built a Local RAG Agent in 48 Hours — Here's Why It Matters

#ai #automation #productivity #opensource

I spent last weekend building a local retrieval-augmented generation (RAG) agent.

It took me exactly 46 hours from concept to a working prototype.

Most developers are still obsessed with cloud-based LLMs and massive API bills.

They ignore the quiet revolution happening on our own machines.

Local models have gotten good enough for serious work.

I used Llama-3-8B quantized to 4-bit precision.

It runs on my MacBook Pro M2 with 16GB RAM.

No GPU cluster required. No monthly subscription fees.

The speed is instantaneous once the model loads.

Latency dropped from 800ms to 120ms per token.

This isn't just about saving money.

It is about data sovereignty and privacy.

I can feed it proprietary codebases without fear.

Nobody is talking about this shift because it lacks hype.

There are no venture capital firms funding "offline AI" yet.

But the developer experience is superior for specific tasks.

The Problem With Cloud Dependencies

In early 2025, I migrated a legacy documentation system to a cloud vector store.

It worked well until the API provider changed their pricing tier.

My monthly bill jumped from $45 to $320 overnight.

That was a wake-up call.

I realized my entire workflow depended on external uptime.

When their status page turned red, my productivity stopped.

I couldn't query my own notes.

I couldn't search my codebase.

I was locked out of my own knowledge.

Cloud providers offer convenience, but they rent you access.

They do not sell you ownership.

For sensitive client data, sending snippets to a third-party server is a liability.

Legal teams hate it. Developers tolerate it.

I wanted a solution that lived entirely on my disk.

Zero network calls after the initial download.

Complete isolation from internet outages.

Building The Local Stack

I chose a simple stack to minimize complexity.

Ollama handles the model serving.

LangChain manages the orchestration logic.

ChromaDB stores the vector embeddings locally.

The setup process was surprisingly smooth.

I installed Ollama via Homebrew.

brew install ollama

Then I pulled the Llama-3 model.

ollama pull llama3:8b-instruct-q4_K_M

The quantization level matters here.

Q4_K_M offers the best balance of speed and accuracy for consumer hardware.

I tested Q8_0 first, but it consumed too much memory.

My system started swapping to disk, killing performance.

Q4_K_M kept memory usage under 6GB.

This left plenty of room for the operating system and other apps.

Here is the core ingestion script I wrote:

from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

def ingest_docs(path: str):
    loader = DirectoryLoader(path, glob="**/*.md")
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50
    )
    chunks = splitter.split_documents(docs)

    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./local_db"
    )
    vectorstore.persist()
    print(f"Ingested {len(chunks)} chunks")

ingest_docs("./docs")

This script processes markdown files from a directory.

It splits them into manageable chunks.

Nomic-embed-text is a lightweight embedding model.

It runs locally alongside the main LLM.

The entire indexing process for 500 files took 12 minutes.

That is a one-time cost.

Subsequent queries are near-instant.

Performance Reality Check

Local models are not magic.

They have limitations compared to GPT-4o or Claude 3.5 Sonnet.

I ran a benchmark suite against three scenarios.

Scenario	Cloud Model (GPT-4o)	Local Model (Llama-3-8B)	Verdict
Code Generation	9/10 accuracy	7/10 accuracy	Cloud wins for complex logic
Summarization	8/10 quality	8/10 quality	Tie for basic summaries
Data Extraction	95% precision	88% precision	Cloud slightly better
Latency (avg)	1.2s	0.4s	Local is 3x faster
Cost per 1k req	$0.03	$0.00	Local is free

The local model struggles with nuanced reasoning.

If you ask it to refactor a complex React hook, it might miss edge cases.

But for searching documentation or answering factual questions, it is excellent.

I use it primarily as a semantic search engine.

I ask: "How do we handle authentication in the payment module?"

It retrieves the relevant code snippets instantly.

Then I paste those snippets into Cursor or VS Code.

I let the heavier cloud model handle the actual refactoring.

This hybrid approach saves

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

DEV Community

I Built a Local RAG Agent in 48 Hours — Here's Why It Matters

The Problem With Cloud Dependencies

Building The Local Stack

Performance Reality Check

This hybrid approach saves

Top comments (0)