Md Ayan Arshad

Posted on May 4

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

#ai #programming #discuss #datascience

Standard recursive splitters failing on code

I assumed chunking was a solved problem. Pick a text splitter, set 512 tokens, add some overlap, move on. After running structured experiments across three different data types, that assumption collapsed. The best chunker for markdown documentation actively hurt performance on code. The winner changed completely depending on what I was chunking.

TL;DR

Data type	Winner	Headline metric
Markdown docs	HeadingAwareChunker	MRR 0.755 vs SlidingWindow 0.687
PDFs	RecursiveChar (512 tok)	Context Recall 0.9250, RAGAS SUM 3.4249
GitHub code	CodeBlockAwareChunker	RAGAS SUM 3.5680 — highest across all experiments

RecursiveChar won on PDFs. The same chunker scored 0.5690 Context Precision on code, roughly half the retrieved chunks were irrelevant. There is no universal best chunker. The data type decides.

What I was building

A RAG system that ingests documentation sites, PDFs, and GitHub repositories for multiple tenants, then answers developer questions with citations. Before embedding anything, I had to decide how to chunk each source type.

The standard advice is "use a recursive text splitter." Every tutorial does this. But markdown docs have headings, PDFs have paragraphs, code has functions. A function is a complete semantic unit, splitting it at token 256 and you've lost the return type, the error handling, the docstring. None of that is recoverable at query time

So I ran experiments, one variable changed per experiment: the chunker

The embedding model, retrieval method, reranker, LLM, and eval set stayed fixed.
RAGAS scored every pipeline on the same frozen question set.

3 data types, 3 experiments and here's what happened.

The full implementation, experiment notebooks, and eval sets are on GitHub

Experiment 1: Documentation (.md / .mdx)

Corpus: FastAPI and Supabase documentation, 78 QA pairs generated by GPT-4o, frozen after generation

Chunkers tested: HeadingAwareChunker (HAC), SlidingWindow-128, RecursiveChar, SemanticBlock

Key metric: MRR (Mean Reciprocal Rank), recall@5 tells you if the answer is somewhere in the top 5. MRR tells you if it's at rank 1, whether the right chunk comes first, not just eventually

Chunker	MRR (no reranker)	Chunks produced
HeadingAwareChunker	0.755	127
SlidingWindow-128	0.687	259

HAC produced the same Recall@5 as SlidingWindow (~0.82) but with significantly better MRR. The right answer appeared at rank 1 more often. And HAC did it with 127 chunks versus SlidingWindow's 259, half the chunks, better ranking, cheaper retrieval

Why? Markdown documentation is already structured by headings. Each section covers one concept, one API endpoint, one configuration option. HAC splits exactly at those heading boundaries. SlidingWindow ignores them entirely, it cuts at token count, which means a chunk might start halfway through one concept and end halfway through the next.

The embedding model then has to encode a chunk that mixes two ideas. The resulting vector is somewhere between them, and retrieval becomes imprecise.

Winner: HeadingAwareChunker.

Experiment 2: PDFs

Corpus: 5 technical PDFs from FastAPI concepts, Kubernetes architecture, React patterns, Stripe API reference, AWS overview along with 40 QA pairs

Chunkers tested: SlidingWindow-128, SemanticBlock, RecursiveChar (512 tokens, 50 overlap). HeadingAwareChunker was not included here, pymupdf4llm extracts PDFs to Markdown, but the heading hierarchy in PDFs is inconsistent across documents. Font-size-based heading detection is fragile enough that HAC's boundaries would be unreliable. The experiment focused on chunkers that work on paragraph level structure, which is what the extraction reliably produces

Chunker	Context Recall	RAGAS SUM
RecursiveChar	0.9250	3.4249
SlidingWindow-128	0.8750	3.3691
SemanticBlock	0.8167	3.2627

RecursiveChar won by a clear margin. Context Recall 0.9250 versus SlidingWindow's 0.8750.

The reason is specific to how I extracted the PDFs. I used pymupdf4llm, which converts PDFs to Markdown. The output is clean paragraphs with heading markers. RecursiveChar's default split points, double newlines, single newlines and it aligns naturally with those paragraph boundaries. It didn't need to classify blocks or detect headings. The structure was already there; RC just respected it.

SemanticBlock failed on the Stripe API PDF. That document's navigation sidebar produced 12-token noise chunks, fragment after fragment of menu items. Those wasted retrieval slots on every single query.

Winner: RecursiveChar.

Note what just happened: HAC won on docs, RC won on PDFs. Two different data types, two different winners and the experiments are only half done

Experiment 3: GitHub code

Corpus: encode/httpx repository having 90 files (60 Python, 29 Markdown, 1 text). 50 QA pairs focused on function behavior, parameters, and return values

Chunkers tested: CodeBlockAwareChunker (CBAC), RecursiveChar, SlidingWindow-128

Chunker	Ctx Precision	Ctx Recall	RAGAS SUM
CodeBlockAwareChunker	0.7812	0.9700	3.5680
SlidingWindow-128	0.8278	0.9150	3.4957
RecursiveChar	0.5690	0.9400	3.2856

RecursiveChar scored 0.5690 on Context Precision, that means roughly half of the retrieved chunks were irrelevant to the question. The same chunker that won on PDFs failed on code.

The failure mode is direct. Python code is full of blank lines between a function's docstring and its body, between logical sections inside a method, between a guard clause and the main logic. RecursiveChar splits at blank lines. So it routinely bundled two or three unrelated functions into a single chunk, averaging 457 tokens. When someone asks "what does Client.send() return," the retrieved chunk contains send() plus get() plus the __init__ method. Everything but a focused answer.

CBAC doesn't use blank lines. For Python files, it uses the ast module, it finds the exact byte offset of every function and class definition in the syntax tree, then extracts each one as a separate chunk. Zero false splits. The average chunk was 120 tokens, one complete function.

SlidingWindow 128 had the best Context Precision (0.8278), small windows avoid the bundling problem. But it split functions mid-body. A function's return value might land in the next window. That killed Recall: 0.9150 versus CBAC's 0.9700.

CBAC with a full reranker pipeline achieved RAGAS SUM 3.7079, the highest score across all experiments in this project and the PDF best was 3.4843

Winner: CodeBlockAwareChunker.

Why the results differ and why they shouldn't surprise you

Each experiment picked a different chunker, but every result points at the same question: what is the natural semantic unit of this data?

For markdown documentation, it's the section under a heading. That's a discrete concept, authored that way intentionally.

For PDFs extracted to Markdown, it's the paragraph. The extraction tool already produces those boundaries. The chunker just has to respect them.

For code, it's the function or class. A function is the smallest unit of behavior that makes sense alone. Split it and the chunk becomes meaningless without the surrounding context.

Text splitters, recursive or sliding window, don't know any of this. They operate on character counts, token counts, or blank lines. None of those correspond to semantic boundaries in code. That's the root cause of RecursiveChar's 0.5690 Context Precision. It wasn't a hyperparameter problem. It was a conceptual mismatch.

There's also a second effect worth naming: chunk count matters. HAC's 127 chunks versus SlidingWindow's 259 on the same corpus is not a coincidence. Fewer chunks means fewer candidates for noise to enter the retrieval pool. The embedding space is less diluted and rank 1 is cleaner

What I learned

The optimal chunker is determined by the data type, not by chunk size or overlap settings
RecursiveChunker's blank-line heuristic is a real liability for code, 0.5690 Context Precision proves it
Smaller average chunks (120 tokens) outperformed larger ones (457 tokens) on code by a significant margin, chunk size is a symptom, not a cause
Visual inspection of actual chunks before running RAGAS catches structural bugs that aggregate scores smooth over, I caught CBAC producing 8KB chunks on Go files before the experiment ran
Freezing the eval set before the first experiment is non negotiable because regenerating it mid experiment would invalidate every comparison

The practical takeaway

There is no universal best chunker

For markdown documentation: split at heading boundaries
For PDFs: convert to Markdown first, then split at paragraph boundaries
For code: use an AST parser

A generic 512-token splitter will technically work on all three. It will not be optimal on any of them. And on code specifically, the degradation is not marginal, it's a near-halving of retrieval precision.

Pick the chunker that matches the semantic structure of the data, not the one that's easiest to configure.

The harder version of this problem is mixed content, a PDF with embedded code blocks, a GitHub repo where half the files are Python and half are Markdown. Each file type still needs its own chunking strategy, which means the chunker has to detect content type at the file level and route accordingly. That's what the connector layer in this project handles, but it's a separate problem worth its own post.

I'm building a production RAG system that ingests multiple source types with per-source-type chunking strategies. Future posts cover the reranker experiments, eval methodology, and the CI pipeline I built around RAGAS scores.

Top comments (15)

Vic Chen • May 4

This is one of the most practical RAG posts I've read in a while. The fact that RecursiveChar went from winner on PDFs to near-last on code (0.5690 Context Precision) really drives home the point — there's no universal chunker. We hit the same wall building our RAG pipeline: markdown docs need semantic splits, code needs AST-aware chunking. The insight about freezing eval sets before starting experiments is underrated advice. Looking forward to the reranker post!

Md Ayan Arshad • May 4

Thanks, glad it landed that way. The RecursiveChar result on code surprised me too, because blank line splitting feels reasonable until you look at real Python files. Blank lines live inside functions just as much as between them, between docstring and body, between guard clauses and main logic. The chunker can't tell the difference, so it bundles unrelated functions together. The 0.5690 precision number is just that bundling showing up at eval time

Your point about per-content-type strategies is exactly right, one chunker can't win across all three. The frozen eval set lesson is one I wish I'd seen written down before I started, I lost an early round of experiments to eval drift before locking it down.

Reranker post is next, CBAC + full reranker hit RAGAS SUM 3.7079, highest across every experiment

Vic Chen • May 5

The blank-line problem is such a good illustration of chunkers making structural assumptions that only hold in textbooks. Real codebases are noisier. The 0.5690 number is almost predictable once you know the cause.

The frozen eval set discipline is underrated — I've seen the same issue in financial data pipelines where the "ground truth" set quietly drifted as upstream schemas changed. The lesson transfers directly.

Really looking forward to the reranker post. RAGAS SUM 3.7079 is a meaningful jump — curious whether CBAC was doing most of the lifting or whether the full reranker pass was necessary. Did you run ablations separating the two contributions?

Md Ayan Arshad • May 5

Yes, I did run ablations, CBAC alone hit 3.5680, and the full reranker pipeline pushed it to 3.7079. So CBAC was doing most of the heavy lifting on precision, the reranker added on top mostly by cleaning up ranking order. Both contributed but they solve different things

CBAC fixes what goes into the pool, the reranker fixes the order you surface them. I'll break down the exact numbers in the reranker post so you can see where each stage is adding value

Md Ayan Arshad • May 9

Reranker post is live : dev.to/ayanarshad02/i-increased-re...

Vic Chen • May 9

The precision/recall tradeoff strikes again. Counterintuitively, giving the LLM more context to work with often makes things worse — extra candidates introduce noise that dilutes the signal even when a reranker is involved.

Hit the same pattern with financial filings at 13F Insight: jumping from top-5 to top-20 for amendment lookups improved chunk recall but the generated summaries got messier. The fix that worked was a two-stage approach — broader retrieval (top-20) to maximize recall, reranker collapses back to top-5 before generation. The reranker acts as a filter, not just a sorter.

Curious: did you experiment with dynamic top-k based on reranker score distribution (e.g., cut at score drop-off threshold) rather than a fixed ceiling? That would let the context window shrink naturally when the relevant content is tightly clustered.

Mike Talbot ⭐ • May 5

Isn't code best chunked using an AST? Identify function, methods, classes, summarise the affordances, then use that as the vector?

Md Ayan Arshad • May 5

Yes, that's exactly what CodeBlockAwareChunker does here. It uses Python's ast module to find the exact boundaries of every function and class, then extracts each one as a separate chunk. So the vector represents one complete function, not a random token window that cuts mid-body. The summarise-then-embed approach is interesting too, embedding a generated summary of what the function does rather than the raw code. That could help with natural language queries but I haven't tested it yet, worth an experiment.

Olebeng • May 6

Excellent experiment design. The frozen eval set discipline and the single-variable-per-experiment rigour are what make the results actually comparable. That discipline is rarer than it should be in RAG benchmarks.

The CodeBlockAwareChunker result on code confirms something I have run into empirically: the blank-line heuristic in RecursiveChar is a real liability on real codebases, not just a theoretical one. Real Python files have blank lines inside functions everywhere, between docstring and body, between a guard clause and main logic, between two logical sections in a long method. The chunker cannot distinguish those from function boundaries. Your 0.5690 Context Precision number is just that bundling showing up at eval time.

One dimension worth adding to the semantic unit question: the optimal chunk boundary also depends on what the retrieval query is trying to answer, not just on the data type.

For a developer Q&A system asking "what does Client.send() return," function-level chunks are the right unit. But if the downstream analysis needs to answer "does this function validate input before passing data to an external API," you often need the full function including its error handling path and return statements — which means a chunk that splits at line 256 mid-function actively breaks the reasoning chain, regardless of precision.

The domain of the question changes what completeness means for a retrieved chunk. A function chunk that is complete for a Q&A query may be incomplete for a security analysis query that needs the surrounding context to evaluate whether a guard clause actually executes before the sensitive operation.

On your "harder version", the mixed content problem in polyglot repos: tree-sitter is worth evaluating as a unified approach. It supports 100+ languages with a consistent API, which means you can implement AST-aware chunking once and route by language at the file level rather than maintaining separate per-language parsers. The routing logic is the same problem either way, but the implementation surface is much smaller.

Looking forward to the reranker post. The CBAC ablation you shared, 3.5680 without reranker, 3.7079 with, suggests the chunker is doing most of the precision work and the reranker is cleaning up ranking order. That decomposition is useful data.

Md Ayan Arshad • May 6

Thanks @intentguard_ole the frozen eval set discipline came from a painful early lesson. I let the eval set drift in an early round and the numbers became meaningless. Once I locked it, everything became comparable

Your point about query dependent completeness is something I hadn't framed that clearly but it's exactly right. Function-level chunks work well for "what does this return" queries but a security analysis asking "does this validate input before hitting the external API" needs the guard clause, the main logic, and the return path together. A chunk that's complete for one query type is incomplete for another. That's a dimension I want to explore, whether the chunking strategy should be aware of the downstream query domain, not just the data type.

On tree-sitter, that's a really useful suggestion, right now CodeBlockAwareChunker uses Python's ast module which obviously only covers Python files. Tree sitter with language detection at the file level would give AST aware chunking across the whole repo with one implementation. That's the cleaner path for polyglot repos and I'll be testing it next

On the CBAC ablation, yes, that decomposition is exactly what the reranker post will break down. The chunker is doing the heavy lifting on precision, the reranker is mostly cleaning up ordering. Both matter but they're solving different problems at different stages.

Md Ayan Arshad • May 9

reranker article is live : dev.to/ayanarshad02/i-increased-re...

Daniel Visovsky • May 4

0.5690 Context Precision on code with RecursiveChar is honestly worse than random for some queries. Half the retrieved chunks being irrelevant means every other search pulls garbage. Thanks for running the numbers - now I can point to this when someone says just use a 512 token splitter for everything.

Md Ayan Arshad • May 4

Yeah 0.5690 is bad, 44% of retrieved chunks are irrelevant, which means you're paying token cost to feed garbage into the LLM on nearly half your queries. The 512 token default gets away with it on docs and PDFs which is why nobody catches it until they actually test on code. Glad the numbers give you something concrete to point to!!

Ranjan Dailata • May 5

I was chunking the documents same like you and later realized it was the biggest mistake of my life. Hence, I decided to write up a blog post regarding the same.

why-chunking-is-the-biggest-mistak...

Md Ayan Arshad • May 5

Haha yeah it's one of those things that looks like a solved problem until you actually measure it, would love to read your post, always good to have more real experiment data out there on this

View full discussion (15 comments)