Chunking Strategies That Actually Work
Fixed-size vs semantic chunking, overlap windows, and when to use each one.
Chunking is the process of breaking a single long document into smaller segments before embedding or indexing. If chunks are too small, they lack the surrounding context the LLM needs to synthesize an answer. If they are too large, they dilute the semantic vector embedding with irrelevant noise and consume too many tokens in the prompt.
Imagine cutting a two-hour movie into short clips for a trailer. Cut each clip too short (2 seconds) and none of them make sense on their own, you lose all context. Cut each clip too long (20 minutes) and you might as well have shown the whole movie, the clip buries the one moment that mattered inside a pile of irrelevant footage. Good trailer editors also overlap their cuts slightly, so a clip doesn't start mid-sentence and lose the audience. Chunking a document works the same way. The chunk size is how long each clip is. The overlap is the few extra seconds of footage editors leave on both ends so no cut lands mid-thought.
Drag the sliders to see how chunk size and overlap slice the same paragraph differently. Chunks highlighted at the start show the text they share with the chunk before them.
Retrieval-Augmented Generation gives a language model access to knowledge it was never trained on. Instead of trusting the model's frozen memory, the system retrieves relevant text at query time and hands it to the model
at query time and hands it to the model as evidence. This only works if the source documents are split into good chunks first. Cut a chunk too small and it loses the surrounding context a reader would need to make sense
ontext a reader would need to make sense of it. Cut it too large and the one useful sentence gets buried inside a wall of unrelated text, diluting the embedding and wasting tokens. A small overlap between consecutive chu
A small overlap between consecutive chunks acts like a safety net, making sure an idea that spans a chunk boundary is not sliced clean in half. Getting chunk size and overlap right is one of the highest-leverage tuning
t is one of the highest-leverage tuning decisions in a RAG pipeline, often mattering more than which vector database you pick.
Advanced chunking techniques include layout-aware splitting (splitting at natural markdown headers or paragraph boundaries), semantic splitting (evaluating cosine distance between adjacent sentences), and parent-child retrieval (indexing tiny chunks for precision search, but returning a larger parent window to the LLM).
* Semantic Vector Bot: Uses a chunk size of `1500` characters with a `150` character overlap. Larger chunks preserve context for semantic embedding models.
* Smart File Cabinet (Enterprise Search): Uses a chunk size of `900` characters with a `100` character overlap. Smaller chunks are highly effective for hybrid search, as they keep BM25 keyword matching dense and targeted.
def chunk_text(text: str, size: int = 900, overlap: int = 100) -> list[str]:
"""Split text into overlapping character-level chunks."""
chunks, start = [], 0
while start < len(text):
chunks.append(text[start : start + size])
start += size - overlap
return [c.strip() for c in chunks if c.strip()]Choosing the right strategy is an iterative process depending on your document structures. Standard character-level splitting with overlaps is a solid baseline, but production layouts benefit from markdown parsing structure.