Chunking for RAG

-> Go to BOTwiki

Chunking refers to the process of breaking down long documents into smaller, self-contained sections before they are converted into embeddings in a vector database. For Retrieval-Augmented Generation (RAG), chunking is the preliminary step that determines answer quality and hit rates. Poor chunking leads to hallucinations or incomplete answers, while good chunking forms the foundation of a robust knowledge base for phonebots and chatbots—regardless of whether the content comes from FAQs, manuals, or contract documents.

 

Why Chunking Matters

An LLM always answers a question based on the context provided in the prompt. In the case of RAG , this context is dynamically constructed from relevant document sections. If the sections are too long, they unnecessarily consume context window space and contain irrelevant information. If they are too short, the semantic context is missing. Good chunking strikes a balance and is both substantively complete and technically efficient.

 

Common chunking strategies

  • Fixed-size chunking: Text is divided into chunks of a fixed size, often with overlap. Easy to implement, but semantically insensitive.
  • Semantic Chunking: Boundaries at semantic breaks, such as paragraphs, chapter headings, or changes in topic.
  • Hierarchical Chunking: Documents are broken down into multiple levels—broad section chunks and finer sub-chunks—and linked contextually.
  • Format-Aware Chunking: Structural information is taken into account for tables, lists, and Markdown.

 

Chunking, Reranking, and Knowledge AI

Chunking is just the first step. This is followed by embedding, vector search, and often a re-ranking step that sorts the top results by relevance once again. Only the combination of these steps results in an efficient knowledge AI that ensures that voicebotsand chatbots provide factually accurate responses. 

 

Practical Tips for Stable Chunks

In practice, a balanced mix works best. Experience shows that Markdown-optimized content with clear headings, organized into section chunks of a few hundred tokens with moderate overlap, provides the best balance between precision and completeness. Tables should be treated as atomic units, while legal texts benefit from chunking by paragraph. Iterative tuning is important, accompanied by hard evaluation metrics such as hit rate, NDCG, and response quality.

 

 

Frequently Asked Questions (FAQ)

There is no one-size-fits-all answer. A good starting point is to use chunks of a few hundred tokens with some overlap. Iteration based on actual search quality is crucial.

Responses lose accuracy, RAG hits become unreliable, and the risk of hallucinations increases noticeably.

Each chunk is converted into an embedding and stored in a vector database. The quality of the chunk therefore directly determines the informative value of the embeddings.

–> Back to the BOTwiki