WonderLab

Posted on May 13

RAG Series (15): CRAG — Self-Correcting When Retrieval Falls Short

#ai #rag #llm #langchain

The Knowledge Base Boundary Problem

Previous articles optimized retrieval quality — better chunking, more precise ranking, smarter query formulation. But one fundamental problem was always sidestepped:

What if the knowledge base simply doesn't contain the answer?

When a user asks something outside the knowledge base's coverage, vector retrieval still returns "the most similar" documents — but these documents may have nothing to do with what the user actually wants to know. The LLM, handed these documents, either hallucinates an answer grounded in irrelevant content, or says "I can't answer based on the provided context." Neither outcome is acceptable.

This is traditional RAG's blind spot: it never questions the quality of what it retrieved. It blindly passes documents to the LLM regardless of relevance.

CRAG (Corrective RAG), proposed in 2024, adds a self-correction step: evaluate retrieved document quality, and when it falls short, actively trigger a web search as a supplement or replacement — rather than generating answers from low-quality context.

The CRAG Flow

User question
    ↓
Vector retrieval (knowledge base)
    ↓
Relevance scoring: score each retrieved doc from 0.0 to 1.0
    ↓
Three-way verdict
    ├─ CORRECT   (avg ≥ 0.7) → use knowledge base docs directly
    ├─ INCORRECT (avg ≤ 0.3) → discard KB, trigger web search
    └─ AMBIGUOUS (in between) → merge KB docs + web search results
    ↓
(Web search results are refined by LLM to extract key information)
    ↓
Generate final answer from assembled docs

The key distinction from Self-RAG:

Self-RAG answers "should we retrieve?" — decides before retrieval
CRAG answers "are the retrieval results good enough?" — evaluates after retrieval, corrects when needed

LangGraph Implementation

State

class CRAGState(TypedDict):
    question: str
    retrieved_docs: list[Document]
    doc_scores: list[float]          # relevance score per document
    overall_score: float             # average score
    retrieval_verdict: str           # "correct" | "ambiguous" | "incorrect"
    web_results: str                 # raw web search output
    refined_web_docs: list[Document] # LLM-refined web search content
    final_docs: list[Document]       # final docs sent to LLM
    answer: str
    path: list[str]

Key Node: Relevance Scoring (score)

This is CRAG's core — scoring each retrieved document independently:

RELEVANCE_SCORE_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Rate how relevant the following document is to the question, "
     "on a scale from 0.0 to 1.0.\n"
     "- 1.0: directly and completely answers the question\n"
     "- 0.5: partially relevant but incomplete\n"
     "- 0.0: completely unrelated\n\n"
     "Output only a float number, no explanation."),
    ("human", "Question: {question}\n\nDocument: {document}"),
])

def make_score_node(llm):
    chain = RELEVANCE_SCORE_PROMPT | llm | StrOutputParser()

    def score_docs(state):
        scores = []
        for doc in state["retrieved_docs"]:
            raw = chain.invoke({
                "question": state["question"],
                "document": doc.page_content[:400],
            })
            score = float(raw.strip())
            scores.append(max(0.0, min(1.0, score)))

        overall = sum(scores) / len(scores)
        verdict = "correct" if overall >= 0.7 else ("incorrect" if overall <= 0.3 else "ambiguous")
        return {**state, "doc_scores": scores, "overall_score": overall,
                "retrieval_verdict": verdict}

    return score_docs

Key Node: Web Search + Refinement (web_search)

Raw search results are noisy. CRAG adds an LLM refinement step to extract signal:

REFINE_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "From the following web search results, extract the key information "
     "most relevant to the question. Remove noise, keep core facts."),
    ("human", "Question: {question}\n\nSearch results:\n{search_results}\n\nExtract:"),
])

def make_web_search_node(search_tool, llm):
    refine_chain = REFINE_PROMPT | llm | StrOutputParser()

    def web_search(state):
        try:
            raw_results = search_tool.invoke(state["question"])
            refined = refine_chain.invoke({
                "question": state["question"],
                "search_results": raw_results[:2000],
            })
            web_doc = Document(page_content=refined,
                               metadata={"source": "web_search"})
            return {**state, "refined_web_docs": [web_doc]}
        except Exception:
            # Graceful fallback when network is unavailable
            return {**state, "refined_web_docs": []}

    return web_search

Key Node: Document Assembly (assemble)

Decides which documents to use based on the verdict:

def make_assemble_node():
    def assemble(state):
        verdict = state["retrieval_verdict"]

        if verdict == "correct":
            scored = sorted(zip(state["retrieved_docs"], state["doc_scores"]),
                           key=lambda x: x[1], reverse=True)
            final = [doc for doc, s in scored if s >= 0.3] or [scored[0][0]]

        elif verdict == "incorrect":
            # Prefer web search; fall back to best KB doc if unavailable
            final = state.get("refined_web_docs", [])
            if not final:
                scored = sorted(zip(state["retrieved_docs"], state["doc_scores"]),
                               key=lambda x: x[1], reverse=True)
                final = [scored[0][0]]

        else:  # ambiguous — merge both
            scored = zip(state["retrieved_docs"], state["doc_scores"])
            kb_docs = [doc for doc, s in scored if s >= 0.3]
            final = kb_docs + state.get("refined_web_docs", [])

        return {**state, "final_docs": final}

    return assemble

Graph Structure

graph.set_entry_point("retrieve")
graph.add_edge("retrieve", "score")
graph.add_conditional_edges(
    "score",
    lambda s: "web_search" if s["retrieval_verdict"] != "correct" else "assemble",
    {"web_search": "web_search", "assemble": "assemble"},
)
graph.add_edge("web_search", "assemble")
graph.add_edge("assemble",   "generate")
graph.add_edge("generate",   END)

Experimental Results

Execution Path Details

CRAG execution paths:

Q1: retrieve → score(ambiguous, 0.62) → web_search(ok) → assemble(5docs) → generate
    "What is RAG and what problem does it solve?"

Q2: retrieve → score(incorrect, 0.12) → web_search(ok) → assemble(1docs) → generate
    "Which vector database should I use for enterprise apps?"

Q3: retrieve → score(ambiguous, 0.45) → web_search(ok) → assemble(4docs) → generate
    "Which embedding model is recommended for Chinese text?"

Q4: retrieve → score(incorrect, 0.20) → web_search(ok) → assemble(1docs) → generate
    "What chunk size is recommended for document splitting?"

Q5: retrieve → score(ambiguous, 0.38) → web_search(ok) → assemble(3docs) → generate
    "What are the four core metrics in the RAGAS framework?"

Q6: retrieve → score(incorrect, 0.00) → web_search(ok) → assemble(1docs) → generate
    "What is the formula for the RRF fusion algorithm?"

Q7: retrieve → score(incorrect, 0.17) → web_search(ok) → assemble(1docs) → generate
    "How does HyDE query optimization work?"

Q8: retrieve → score(ambiguous, 0.33) → web_search(ok) → assemble(3docs) → generate
    "How do production RAG systems implement multi-tenant isolation?"

Verdict distribution: correct=0, ambiguous=4, incorrect=4

Not a single question was scored as correct. 4 were ambiguous, 4 were incorrect. The scoring model is applying strict standards about whether knowledge base documents actually answer the question.

Q6 (RRF formula, score 0.00) and Q7 (HyDE principles, score 0.17) are particularly telling — these are concepts introduced in later articles in this series, so the knowledge base genuinely has no coverage. The scorer correctly identified this and triggered web search.

RAGAS Metrics

======================================================================
  RAGAS Metrics Comparison (Always Retrieve vs CRAG)
======================================================================

  Metric               Always Retrieve    CRAG        Delta
  ──────────────────────────────────────────────────────────
  context_recall            0.625         0.625     →+0.000
  context_precision         0.444         0.875     ↑+0.431  ◀
  faithfulness              0.810         0.907     ↑+0.097
  answer_relevancy          0.402         0.368     ↓-0.033
======================================================================

context_precision +0.431 — the single largest improvement across all experiments in this series.

Why Such a Large Jump?

context_precision measures whether relevant documents are ranked above irrelevant ones.

Always-retrieve feeds all top-4 documents to the LLM equally, regardless of relevance. CRAG's score node assigns a precise relevance score to each document; the assemble node filters by score and sorts by quality. The document set the LLM receives is cleaner and more targeted.

More critically: for incorrect queries, CRAG completely discards the low-quality knowledge base documents and substitutes web search results that directly address the question. These results land at the top of the ranking by definition — context_precision approaches 1.0 for those queries.

The baseline's context_precision of 0.444 (below 0.5!) tells the story clearly: in many queries, relevant documents were actually ranked below irrelevant ones. CRAG's scoring and filtering mechanism inverts this completely.

CRAG vs Self-RAG

Dimension	Self-RAG	CRAG
Decision timing	Before retrieval (should we retrieve?)	After retrieval (are results good enough?)
Core problem	Avoiding unnecessary retrieval	Correcting poor retrieval quality
Fallback	Direct generation (no retrieval)	Web search (external knowledge)
Best scenario	Mixed-intent system; some queries need no retrieval	Knowledge base with limited coverage
Key nodes	decide → route	score → assemble

The two are complementary, not alternatives. Self-RAG can decide whether to retrieve; CRAG can evaluate and correct what was retrieved. In a production system they can be combined.

When to Use CRAG

CRAG is a strong fit when:

Your knowledge base has limited or uneven coverage — users regularly ask questions outside its scope
Your knowledge base updates slowly while users need current information
Answer quality is the top priority and you can absorb the extra cost of scoring + web search

Caveats to consider:

Score threshold calibration: The 0.7 / 0.3 thresholds need tuning per knowledge base. Our thresholds were strict enough that zero queries were "correct" — a production deployment might use softer thresholds
Web search reliability: Search results vary in quality; the LLM refinement step is important, and the system needs to degrade gracefully when the network is unavailable
Cost increase: Scoring (4 LLM calls per query) + possible web search + refinement adds significant token overhead compared to basic retrieval

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/15-crag

Key file:

crag.py — Full CRAG implementation with LangGraph

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 15-crag
cp .env.example .env
pip install -r requirements.txt
python crag.py

Summary

This article implemented CRAG with LangGraph. Key findings:

Score-driven filtering is the root cause of the dramatic context_precision improvement (+0.431). Scoring documents, filtering by quality, and sorting by relevance is far more effective than blindly using top-k results.
Web search fallback patches knowledge base coverage gaps — when the KB genuinely can't answer, CRAG fetches and refines external content rather than hallucinating.
CRAG and Self-RAG are complementary: Self-RAG handles "should we retrieve?", CRAG handles "is what we retrieved good enough?". In production systems they can be layered.

One insight worth highlighting: the scoring and filtering mechanism itself is independently valuable. Even without connecting a web search backend, grafting CRAG's score node onto a standard RAG pipeline would significantly improve context_precision.

DEV Community