DEV Community

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

Md Ayan Arshad on May 04, 2026

I assumed chunking was a solved problem. Pick a text splitter, set 512 tokens, add some overlap, move on. After running structured experiments acro...
Collapse
 
vicchen profile image
Vic Chen

This is one of the most practical RAG posts I've read in a while. The fact that RecursiveChar went from winner on PDFs to near-last on code (0.5690 Context Precision) really drives home the point — there's no universal chunker. We hit the same wall building our RAG pipeline: markdown docs need semantic splits, code needs AST-aware chunking. The insight about freezing eval sets before starting experiments is underrated advice. Looking forward to the reranker post!

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Thanks, glad it landed that way. The RecursiveChar result on code surprised me too, because blank line splitting feels reasonable until you look at real Python files. Blank lines live inside functions just as much as between them, between docstring and body, between guard clauses and main logic. The chunker can't tell the difference, so it bundles unrelated functions together. The 0.5690 precision number is just that bundling showing up at eval time

Your point about per-content-type strategies is exactly right, one chunker can't win across all three. The frozen eval set lesson is one I wish I'd seen written down before I started, I lost an early round of experiments to eval drift before locking it down.

Reranker post is next, CBAC + full reranker hit RAGAS SUM 3.7079, highest across every experiment

Collapse
 
vicchen profile image
Vic Chen

The blank-line problem is such a good illustration of chunkers making structural assumptions that only hold in textbooks. Real codebases are noisier. The 0.5690 number is almost predictable once you know the cause.

The frozen eval set discipline is underrated — I've seen the same issue in financial data pipelines where the "ground truth" set quietly drifted as upstream schemas changed. The lesson transfers directly.

Really looking forward to the reranker post. RAGAS SUM 3.7079 is a meaningful jump — curious whether CBAC was doing most of the lifting or whether the full reranker pass was necessary. Did you run ablations separating the two contributions?

Thread Thread
 
ayanarshad02 profile image
Md Ayan Arshad

Yes, I did run ablations, CBAC alone hit 3.5680, and the full reranker pipeline pushed it to 3.7079. So CBAC was doing most of the heavy lifting on precision, the reranker added on top mostly by cleaning up ranking order. Both contributed but they solve different things

CBAC fixes what goes into the pool, the reranker fixes the order you surface them. I'll break down the exact numbers in the reranker post so you can see where each stage is adding value

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Reranker post is live : dev.to/ayanarshad02/i-increased-re...

Collapse
 
vicchen profile image
Vic Chen

The precision/recall tradeoff strikes again. Counterintuitively, giving the LLM more context to work with often makes things worse — extra candidates introduce noise that dilutes the signal even when a reranker is involved.

Hit the same pattern with financial filings at 13F Insight: jumping from top-5 to top-20 for amendment lookups improved chunk recall but the generated summaries got messier. The fix that worked was a two-stage approach — broader retrieval (top-20) to maximize recall, reranker collapses back to top-5 before generation. The reranker acts as a filter, not just a sorter.

Curious: did you experiment with dynamic top-k based on reranker score distribution (e.g., cut at score drop-off threshold) rather than a fixed ceiling? That would let the context window shrink naturally when the relevant content is tightly clustered.

Collapse
 
miketalbot profile image
Mike Talbot ⭐

Isn't code best chunked using an AST? Identify function, methods, classes, summarise the affordances, then use that as the vector?

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Yes, that's exactly what CodeBlockAwareChunker does here. It uses Python's ast module to find the exact boundaries of every function and class, then extracts each one as a separate chunk. So the vector represents one complete function, not a random token window that cuts mid-body. The summarise-then-embed approach is interesting too, embedding a generated summary of what the function does rather than the raw code. That could help with natural language queries but I haven't tested it yet, worth an experiment.

Collapse
 
intentguard_ole profile image
Olebeng

Excellent experiment design. The frozen eval set discipline and the single-variable-per-experiment rigour are what make the results actually comparable. That discipline is rarer than it should be in RAG benchmarks.

The CodeBlockAwareChunker result on code confirms something I have run into empirically: the blank-line heuristic in RecursiveChar is a real liability on real codebases, not just a theoretical one. Real Python files have blank lines inside functions everywhere, between docstring and body, between a guard clause and main logic, between two logical sections in a long method. The chunker cannot distinguish those from function boundaries. Your 0.5690 Context Precision number is just that bundling showing up at eval time.

One dimension worth adding to the semantic unit question: the optimal chunk boundary also depends on what the retrieval query is trying to answer, not just on the data type.

For a developer Q&A system asking "what does Client.send() return," function-level chunks are the right unit. But if the downstream analysis needs to answer "does this function validate input before passing data to an external API," you often need the full function including its error handling path and return statements — which means a chunk that splits at line 256 mid-function actively breaks the reasoning chain, regardless of precision.

The domain of the question changes what completeness means for a retrieved chunk. A function chunk that is complete for a Q&A query may be incomplete for a security analysis query that needs the surrounding context to evaluate whether a guard clause actually executes before the sensitive operation.

On your "harder version", the mixed content problem in polyglot repos: tree-sitter is worth evaluating as a unified approach. It supports 100+ languages with a consistent API, which means you can implement AST-aware chunking once and route by language at the file level rather than maintaining separate per-language parsers. The routing logic is the same problem either way, but the implementation surface is much smaller.

Looking forward to the reranker post. The CBAC ablation you shared, 3.5680 without reranker, 3.7079 with, suggests the chunker is doing most of the precision work and the reranker is cleaning up ranking order. That decomposition is useful data.

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Thanks @intentguard_ole the frozen eval set discipline came from a painful early lesson. I let the eval set drift in an early round and the numbers became meaningless. Once I locked it, everything became comparable

Your point about query dependent completeness is something I hadn't framed that clearly but it's exactly right. Function-level chunks work well for "what does this return" queries but a security analysis asking "does this validate input before hitting the external API" needs the guard clause, the main logic, and the return path together. A chunk that's complete for one query type is incomplete for another. That's a dimension I want to explore, whether the chunking strategy should be aware of the downstream query domain, not just the data type.

On tree-sitter, that's a really useful suggestion, right now CodeBlockAwareChunker uses Python's ast module which obviously only covers Python files. Tree sitter with language detection at the file level would give AST aware chunking across the whole repo with one implementation. That's the cleaner path for polyglot repos and I'll be testing it next

On the CBAC ablation, yes, that decomposition is exactly what the reranker post will break down. The chunker is doing the heavy lifting on precision, the reranker is mostly cleaning up ordering. Both matter but they're solving different problems at different stages.

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

reranker article is live : dev.to/ayanarshad02/i-increased-re...

Collapse
 
danielvisovsky profile image
Daniel Visovsky

0.5690 Context Precision on code with RecursiveChar is honestly worse than random for some queries. Half the retrieved chunks being irrelevant means every other search pulls garbage. Thanks for running the numbers - now I can point to this when someone says just use a 512 token splitter for everything.

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Yeah 0.5690 is bad, 44% of retrieved chunks are irrelevant, which means you're paying token cost to feed garbage into the LLM on nearly half your queries. The 512 token default gets away with it on docs and PDFs which is why nobody catches it until they actually test on code. Glad the numbers give you something concrete to point to!!

Collapse
 
ranjancse profile image
Ranjan Dailata

I was chunking the documents same like you and later realized it was the biggest mistake of my life. Hence, I decided to write up a blog post regarding the same.

why-chunking-is-the-biggest-mistak...

Collapse
 
ayanarshad02 profile image
Md Ayan Arshad

Haha yeah it's one of those things that looks like a solved problem until you actually measure it, would love to read your post, always good to have more real experiment data out there on this