The Knowledge Base Boundary Problem
Previous articles optimized retrieval quality — better chunking, more precise ranking, smarter query formulation. But one fundamental problem was always sidestepped:
What if the knowledge base simply doesn't contain the answer?
When a user asks something outside the knowledge base's coverage, vector retrieval still returns "the most similar" documents — but these documents may have nothing to do with what the user actually wants to know. The LLM, handed these documents, either hallucinates an answer grounded in irrelevant content, or says "I can't answer based on the provided context." Neither outcome is acceptable.
This is traditional RAG's blind spot: it never questions the quality of what it retrieved. It blindly passes documents to the LLM regardless of relevance.
CRAG (Corrective RAG), proposed in 2024, adds a self-correction step: evaluate retrieved document quality, and when it falls short, actively trigger a web search as a supplement or replacement — rather than generating answers from low-quality context.
The CRAG Flow
User question
↓
Vector retrieval (knowledge base)
↓
Relevance scoring: score each retrieved doc from 0.0 to 1.0
↓
Three-way verdict
├─ CORRECT (avg ≥ 0.7) → use knowledge base docs directly
├─ INCORRECT (avg ≤ 0.3) → discard KB, trigger web search
└─ AMBIGUOUS (in between) → merge KB docs + web search results
↓
(Web search results are refined by LLM to extract key information)
↓
Generate final answer from assembled docs
The key distinction from Self-RAG:
- Self-RAG answers "should we retrieve?" — decides before retrieval
- CRAG answers "are the retrieval results good enough?" — evaluates after retrieval, corrects when needed
LangGraph Implementation
State
class CRAGState(TypedDict):
question: str
retrieved_docs: list[Document]
doc_scores: list[float] # relevance score per document
overall_score: float # average score
retrieval_verdict: str # "correct" | "ambiguous" | "incorrect"
web_results: str # raw web search output
refined_web_docs: list[Document] # LLM-refined web search content
final_docs: list[Document] # final docs sent to LLM
answer: str
path: list[str]
Key Node: Relevance Scoring (score)
This is CRAG's core — scoring each retrieved document independently:
RELEVANCE_SCORE_PROMPT = ChatPromptTemplate.from_messages([
("system",
"Rate how relevant the following document is to the question, "
"on a scale from 0.0 to 1.0.\n"
"- 1.0: directly and completely answers the question\n"
"- 0.5: partially relevant but incomplete\n"
"- 0.0: completely unrelated\n\n"
"Output only a float number, no explanation."),
("human", "Question: {question}\n\nDocument: {document}"),
])
def make_score_node(llm):
chain = RELEVANCE_SCORE_PROMPT | llm | StrOutputParser()
def score_docs(state):
scores = []
for doc in state["retrieved_docs"]:
raw = chain.invoke({
"question": state["question"],
"document": doc.page_content[:400],
})
score = float(raw.strip())
scores.append(max(0.0, min(1.0, score)))
overall = sum(scores) / len(scores)
verdict = "correct" if overall >= 0.7 else ("incorrect" if overall <= 0.3 else "ambiguous")
return {**state, "doc_scores": scores, "overall_score": overall,
"retrieval_verdict": verdict}
return score_docs
Key Node: Web Search + Refinement (web_search)
Raw search results are noisy. CRAG adds an LLM refinement step to extract signal:
REFINE_PROMPT = ChatPromptTemplate.from_messages([
("system",
"From the following web search results, extract the key information "
"most relevant to the question. Remove noise, keep core facts."),
("human", "Question: {question}\n\nSearch results:\n{search_results}\n\nExtract:"),
])
def make_web_search_node(search_tool, llm):
refine_chain = REFINE_PROMPT | llm | StrOutputParser()
def web_search(state):
try:
raw_results = search_tool.invoke(state["question"])
refined = refine_chain.invoke({
"question": state["question"],
"search_results": raw_results[:2000],
})
web_doc = Document(page_content=refined,
metadata={"source": "web_search"})
return {**state, "refined_web_docs": [web_doc]}
except Exception:
# Graceful fallback when network is unavailable
return {**state, "refined_web_docs": []}
return web_search
Key Node: Document Assembly (assemble)
Decides which documents to use based on the verdict:
def make_assemble_node():
def assemble(state):
verdict = state["retrieval_verdict"]
if verdict == "correct":
scored = sorted(zip(state["retrieved_docs"], state["doc_scores"]),
key=lambda x: x[1], reverse=True)
final = [doc for doc, s in scored if s >= 0.3] or [scored[0][0]]
elif verdict == "incorrect":
# Prefer web search; fall back to best KB doc if unavailable
final = state.get("refined_web_docs", [])
if not final:
scored = sorted(zip(state["retrieved_docs"], state["doc_scores"]),
key=lambda x: x[1], reverse=True)
final = [scored[0][0]]
else: # ambiguous — merge both
scored = zip(state["retrieved_docs"], state["doc_scores"])
kb_docs = [doc for doc, s in scored if s >= 0.3]
final = kb_docs + state.get("refined_web_docs", [])
return {**state, "final_docs": final}
return assemble
Graph Structure
graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "score")
graph.add_conditional_edges(
"score",
lambda s: "web_search" if s["retrieval_verdict"] != "correct" else "assemble",
{"web_search": "web_search", "assemble": "assemble"},
)
graph.add_edge("web_search", "assemble")
graph.add_edge("assemble", "generate")
graph.add_edge("generate", END)
Experimental Results
Execution Path Details
CRAG execution paths:
Q1: retrieve → score(ambiguous, 0.62) → web_search(ok) → assemble(5docs) → generate
"What is RAG and what problem does it solve?"
Q2: retrieve → score(incorrect, 0.12) → web_search(ok) → assemble(1docs) → generate
"Which vector database should I use for enterprise apps?"
Q3: retrieve → score(ambiguous, 0.45) → web_search(ok) → assemble(4docs) → generate
"Which embedding model is recommended for Chinese text?"
Q4: retrieve → score(incorrect, 0.20) → web_search(ok) → assemble(1docs) → generate
"What chunk size is recommended for document splitting?"
Q5: retrieve → score(ambiguous, 0.38) → web_search(ok) → assemble(3docs) → generate
"What are the four core metrics in the RAGAS framework?"
Q6: retrieve → score(incorrect, 0.00) → web_search(ok) → assemble(1docs) → generate
"What is the formula for the RRF fusion algorithm?"
Q7: retrieve → score(incorrect, 0.17) → web_search(ok) → assemble(1docs) → generate
"How does HyDE query optimization work?"
Q8: retrieve → score(ambiguous, 0.33) → web_search(ok) → assemble(3docs) → generate
"How do production RAG systems implement multi-tenant isolation?"
Verdict distribution: correct=0, ambiguous=4, incorrect=4
Not a single question was scored as correct. 4 were ambiguous, 4 were incorrect. The scoring model is applying strict standards about whether knowledge base documents actually answer the question.
Q6 (RRF formula, score 0.00) and Q7 (HyDE principles, score 0.17) are particularly telling — these are concepts introduced in later articles in this series, so the knowledge base genuinely has no coverage. The scorer correctly identified this and triggered web search.
RAGAS Metrics
======================================================================
RAGAS Metrics Comparison (Always Retrieve vs CRAG)
======================================================================
Metric Always Retrieve CRAG Delta
──────────────────────────────────────────────────────────
context_recall 0.625 0.625 →+0.000
context_precision 0.444 0.875 ↑+0.431 ◀
faithfulness 0.810 0.907 ↑+0.097
answer_relevancy 0.402 0.368 ↓-0.033
======================================================================
context_precision +0.431 — the single largest improvement across all experiments in this series.
Why Such a Large Jump?
context_precision measures whether relevant documents are ranked above irrelevant ones.
Always-retrieve feeds all top-4 documents to the LLM equally, regardless of relevance. CRAG's score node assigns a precise relevance score to each document; the assemble node filters by score and sorts by quality. The document set the LLM receives is cleaner and more targeted.
More critically: for incorrect queries, CRAG completely discards the low-quality knowledge base documents and substitutes web search results that directly address the question. These results land at the top of the ranking by definition — context_precision approaches 1.0 for those queries.
The baseline's context_precision of 0.444 (below 0.5!) tells the story clearly: in many queries, relevant documents were actually ranked below irrelevant ones. CRAG's scoring and filtering mechanism inverts this completely.
CRAG vs Self-RAG
| Dimension | Self-RAG | CRAG |
|---|---|---|
| Decision timing | Before retrieval (should we retrieve?) | After retrieval (are results good enough?) |
| Core problem | Avoiding unnecessary retrieval | Correcting poor retrieval quality |
| Fallback | Direct generation (no retrieval) | Web search (external knowledge) |
| Best scenario | Mixed-intent system; some queries need no retrieval | Knowledge base with limited coverage |
| Key nodes | decide → route | score → assemble |
The two are complementary, not alternatives. Self-RAG can decide whether to retrieve; CRAG can evaluate and correct what was retrieved. In a production system they can be combined.
When to Use CRAG
CRAG is a strong fit when:
- Your knowledge base has limited or uneven coverage — users regularly ask questions outside its scope
- Your knowledge base updates slowly while users need current information
- Answer quality is the top priority and you can absorb the extra cost of scoring + web search
Caveats to consider:
- Score threshold calibration: The 0.7 / 0.3 thresholds need tuning per knowledge base. Our thresholds were strict enough that zero queries were "correct" — a production deployment might use softer thresholds
- Web search reliability: Search results vary in quality; the LLM refinement step is important, and the system needs to degrade gracefully when the network is unavailable
- Cost increase: Scoring (4 LLM calls per query) + possible web search + refinement adds significant token overhead compared to basic retrieval
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/15-crag
Key file:
-
crag.py— Full CRAG implementation with LangGraph
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 15-crag
cp .env.example .env
pip install -r requirements.txt
python crag.py
Summary
This article implemented CRAG with LangGraph. Key findings:
- Score-driven filtering is the root cause of the dramatic context_precision improvement (+0.431). Scoring documents, filtering by quality, and sorting by relevance is far more effective than blindly using top-k results.
- Web search fallback patches knowledge base coverage gaps — when the KB genuinely can't answer, CRAG fetches and refines external content rather than hallucinating.
- CRAG and Self-RAG are complementary: Self-RAG handles "should we retrieve?", CRAG handles "is what we retrieved good enough?". In production systems they can be layered.
One insight worth highlighting: the scoring and filtering mechanism itself is independently valuable. Even without connecting a web search backend, grafting CRAG's score node onto a standard RAG pipeline would significantly improve context_precision.
Top comments (0)