Why your local LLM knowledge base gives bad answers (and how to fix it)

#rag #ai #python #llm

The frustrating problem

You set up a local model runner, downloaded a decent 7B or 13B, pointed it at a folder of your personal notes... and the answers are garbage. It either hallucinates wildly or returns "I don't have information about that" when the answer is literally in the documents you fed it.

I've been down this rabbit hole for the past few months trying to build a personal knowledge base for non-coding life stuff — medical history, financial records, journal entries, recipe notes, household maintenance logs. The promise is great: local, private, no API costs, no data going to a vendor. The reality is that most "just point LLM at folder" setups produce frustratingly bad results.

The issue almost never is the model. It's the retrieval layer.

Root cause: garbage in, garbage out

When you ask a local LLM about your documents, you're not actually feeding it all your documents at once. The context window can't hold them. Instead, a retrieval pipeline does this:

Documents → chunks
Chunks → embeddings (vectors)
Query → embedding
Find similar chunks via vector search
Stuff retrieved chunks into the prompt
LLM generates answer from that context

Every step here can sabotage you. The three most common failures I've debugged in my own setup:

Bad chunking — splitting mid-sentence or grouping unrelated content together
Wrong embedding model — using one that doesn't capture semantic similarity for your domain
Insufficient retrieval — returning the top 3 chunks when you need top 10 to reconstruct context

Let me walk through how to fix each.

Step 1: Fix your chunking

Default text splitters chunk by character count. For a journal entry like "Tuesday — finally fixed the leaky pipe under the sink. Used PTFE tape. Cost about $40 in parts," a naive 500-char split might leave "Cost about" in one chunk and "$40 in parts" in the next. Now neither chunk cleanly answers "how much did I spend on plumbing last month?"

Use a recursive splitter with overlap and semantic boundaries:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=80,  # critical — bleed context across boundaries
    separators=["\n\n", "\n", ". ", " ", ""],  # try paragraph breaks first
    length_function=len,
)

chunks = splitter.split_text(document)

The overlap matters more than people realize. If a key sentence sits on a chunk boundary, overlap means retrieval can find it from either side.

Step 2: Pick the right embedding model

Most tutorials default to whatever's first in the docs. For mixed personal content (journals, receipts, medical notes, recipes), generic models often miss subtle semantic links between how you wrote something and how you'd later ask about it. Test a few against your actual data:

from sentence_transformers import SentenceTransformer

# General purpose — solid baseline
model_a = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Tuned for question-answering style queries
model_b = SentenceTransformer("intfloat/e5-base-v2")

# Run the same query through both and compare retrieved chunks
query = "what did I take for headaches last year?"
emb_a = model_a.encode(query, normalize_embeddings=True)
emb_b = model_b.encode(query, normalize_embeddings=True)

I keep a small eval set of ~20 real questions I'd actually ask my knowledge base, then run them through each pipeline and grade the retrieved chunks by hand. Takes an afternoon. Saves weeks.

Step 3: Don't trust top-k=3

The default in most RAG examples is to retrieve the top 3 chunks. For knowledge spread across multiple documents — like figuring out the timeline of when you switched insurance providers — three chunks won't cut it.

Combine wider retrieval with reranking:

import chromadb
from sentence_transformers import CrossEncoder

client = chromadb.PersistentClient(path="./kb")
collection = client.get_collection("personal_notes")

# Cast a wider net first
results = collection.query(
    query_texts=["timeline of insurance changes"],
    n_results=20,  # over-retrieve, then prune
)

# Rerank with a cross-encoder — slower but much more accurate
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [["timeline of insurance changes", doc] for doc in results["documents"][0]]
scores = reranker.predict(pairs)

# Take the top 8 after reranking
ranked = sorted(zip(scores, results["documents"][0]), reverse=True)[:8]
final_context = "\n\n".join(doc for _, doc in ranked)

The first pass is fast and approximate. The reranker is slow but actually reads each chunk against the query. This two-stage pattern fixed roughly 70% of the wrong answers I was getting on my own data. The cost is latency — expect another 200-500ms depending on hardware — but for a personal tool that's fine.

Step 4: Add metadata filtering

Personal knowledge bases have natural categories — date, source, document type. Without filtering, your query "what was my blood pressure last winter" might pull a chunk from a 2019 doctor visit because the language is similar. Tag everything as you ingest:

collection.add(
    documents=[chunk_text],
    metadatas=[{
        "source": "journal",
        "date": "2025-11-14",
        "topic": "health",
    }],
    ids=[chunk_id],
)

# Then constrain retrieval at query time
results = collection.query(
    query_texts=["blood pressure readings"],
    n_results=10,
    where={"$and": [
        {"topic": "health"},
        {"date": {"$gte": "2024-12-01"}},
    ]},
)

For dates especially, this is night and day. The LLM stops conflating events from different years.

Prevention tips

A few habits that keep the system useful long-term:

Re-embed when you switch models. Vectors from one embedder are not comparable to another. If you upgrade, rebuild the whole index.
Test with real queries, not synthetic ones. "What does the document say about X?" is not how you actually use it. Write down what you'd really ask.
Log every query and its retrieved chunks. When an answer is wrong, you need to know whether retrieval failed or generation failed. They have different fixes.
Don't over-engineer ingestion. Markdown files in a folder, watched by a small script, is enough. I wasted two weeks building a fancy ingestion pipeline before realizing my actual problem was reranking, not ingestion.

The "I just want to chat with my notes" use case sounds simple, but it's actually a stack of small problems stacked on top of each other. Get the retrieval right and even a 7B model gives surprisingly useful answers. Get retrieval wrong and even a 70B model looks dumb.

I'm still iterating on mine. Currently testing whether a hybrid BM25 + dense retrieval gives better recall for proper nouns — names, places, specific medications — where pure embeddings sometimes miss the exact match. I haven't tested this thoroughly yet, but the early results look promising.