Chunking Strategies

Chunking is the process of segmenting text into smaller, manageable portions based on length, structure or semantic meaning. It allows vector search to focus on precise information rather than entire documents. Understanding different chunking methods helps improve retrieval accuracy and model performance in Retrieval Augmented Generation pipelines.

Need for Chunking

LLM Token Limitations: Long documents may exceed token limits, making direct processing inefficient or requiring techniques like truncation or sliding windows
Improved Retrieval Accuracy: Smaller segments allow retrieval pipelines to match context more precisely.
Better Performance: Chunking reduces computation overhead and speeds up embedding searches.
Context Preservation: Keeps relevant text together, reducing hallucinations and incorrect reasoning.
Efficient Knowledge Access: Enables document querying without loading entire files into memory.

1. Fixed-Size Chunking: Splits text into equal-sized segments based on characters or tokens.

Simple and easy to implement
Works well for plain text
May break sentences or context

2. Recursive Character Splitter: Splits text using multiple fallback rules to preserve structure.

Maintains sentence flow
Avoids abrupt splits
Produces more readable chunks

3. Token-Based Chunking: Splits text based on model token limits.

Prevents token overflow in LLMs
Aligns with model constraints
Reduces truncation issues

4. Sentence or Semantic Chunking: Groups text based on meaning or sentence boundaries.

Preserves semantic context
Ideal for descriptive content
Improves retrieval quality

5. Document-Based Chunking: Breaks structured documents into logical sections.

Works well for PDFs and web pages
Maintains document hierarchy
Useful for large datasets

Chunk Overlap

Chunk overlap refers to the technique of including a small portion of text from the end of one chunk at the beginning of the next chunk. This helps maintain continuity between chunks and prevents important information from being lost when text is split. It is especially useful when sentences or ideas span across multiple chunks.

Maintains Context Flow: Overlapping small portions of text ensures that important information crossing chunk boundaries is preserved.
Reduces Context Loss: When a sentence spans two chunks, overlaps prevent missing meaning during retrieval.
Improves Answer Accuracy: Retrieval models gain continuity, leading to clearer and more complete responses.
Better Semantic Understanding: Overlaps enhance embeddings by preserving transitional phrases and linked ideas.

Selecting Chunk Sizes

Choosing the right chunk size depends on the type of document and the use case. If chunks are too large, the model may include unnecessary data. If chunk is too small, it may lose essential meaning. Some recommended chunk sizes in LangChain are:

300–500 Tokens: Useful for most general documents where moderate context is needed.
600–900 Tokens: Ideal for technical guides and manuals requiring deeper reference context.
100–200 Tokens: Effective for short chats, logs or small knowledge fragments.

Implementation of Chunking Strategies

Step 1: Install Required Libraries

Installing LangChain for chunking utilities.

Python

!pip install langchain

Step 2: Load the Document

Reading the input text file.

Python

text = open("sample_doc.txt", "r").read()

You can download document from here.

1. Fixed-Size Chunking

Python

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = splitter.split_text(text)

print("Fixed Size Chunks:", len(chunks))
print(chunks[0])

Output:

Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.

2. Recursive Character Chunking

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=80)
chunks = splitter.split_text(text)

print("Recursive Chunks:", len(chunks))
print(chunks[0])

Output:

Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.

3. Token-Based Chunking

Python

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=256, chunk_overlap=32)
chunks = splitter.split_text(text)

print("Token-Based Chunks:", len(chunks))
print(chunks[0])

Output:

Token-Based Chunks: 1

Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.

Supervised learning uses labeled data to train predictive models. It is commonly used for tasks like spam detection and sentiment analysis. Unsupervised learning, on the other hand, discovers hidden patterns in unlabeled data, such as customer clustering or anomaly detection.

Reinforcement learning involves agents making decisions by interacting with an environment. They receive rewards or penalties based on their actions and learn optimal behaviors through continuous feedback. This approach is widely used in robotics, game playing, and resource optimization.

Although machine learning is powerful, it also comes with challenges such as data bias, overfitting, model interpretability, and computational complexity. It is important to choose the right algorithms, preprocess data correctly, and validate models properly to ensure reliable results.

4. Sentence / Semantic Chunking

Python

from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
splitter = SemanticChunker(embeddings)

chunks = splitter.split_text(text)

print("Semantic Chunks:", len(chunks))

Output:

Total Chunks Created: 4
Machine learning is a branch of artificial intelligence focused on building systems that learn from data. These systems improve their performance over time without being explicitly programmed. There are many applications of machine learning, such as image classification, speech recognition, recommendation systems, and autonomous driving.

Note: Semantic chunking depends on embedding models and may require external APIs, so it is not included as a runnable example here.

5. Document-Based Chunking

Python

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("sample_doc.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(documents)

print("Document Chunks:", len(chunks))

Output:

Document Chunks: 4

You can download the complete code from here.

Applications

Question Answering: Chunking ensures that only the most relevant text segments are passed to the model, resulting in accurate and context-aware answers.
Document Summarization: Long reports and research papers can be divided into sections, allowing LLMs to condense information more effectively.
Semantic Search: By chunking text into context-rich pieces, search engines retrieve more precise and meaningful results rather than broad document matches.
Chatbots: Segmented knowledge bases provide chatbots with localized context, improving reply quality and reducing hallucinations.
Knowledge Graphs: Chunked text can be transformed into nodes and edges, enabling reasoning across distributed concepts and relationships.

Chunking Strategies

Need for Chunking