Chunking Strategies for RAG: From Basic to Enriched

When building a Retrieval-Augmented Generation (RAG) system, one of the most consequential, and often underestimated, decisions you'll make is how you split your documents. Chunking is the process of dividing a corpus of text into smaller pieces that can be embedded, indexed, and retrieved efficiently.

A poor chunking strategy produces fragments without enough context to be useful, or chunks so large they dilute the semantic signal. A good one preserves meaning, boundaries, and, in the most sophisticated variants, enriches each piece with extra context that would otherwise be lost.

This post walks through the full spectrum: from naïve fixed-size splitting to LLM-powered enriched chunks, with code, trade-offs, and guidance on when each approach is the right tool for the job.


Chunking Strategies for RAG: From Basic to Enriched

When building a Retrieval-Augmented Generation (RAG) system, one of the most consequential, and often underestimated, decisions you'll make is how you split your documents. Chunking is the process of dividing a corpus of text into smaller pieces that can be embedded, indexed, and retrieved efficiently.

A poor chunking strategy produces fragments without enough context to be useful, or chunks so large they dilute the semantic signal. A good one preserves meaning, boundaries, and, in the most sophisticated variants, enriches each piece with extra context that would otherwise be lost.

This post walks through the full spectrum: from naïve fixed-size splitting to LLM-powered enriched chunks, with code, trade-offs, and guidance on when each approach is the right tool for the job.


1. Basic Chunking

The simplest possible strategy: split text into fixed-size windows of $n$ tokens (or characters), moving a pointer forward by exactly $n$ at each step. There is no overlap, no awareness of sentence or paragraph boundaries, and no context enrichment.

Fixed-size chunking flow

from langchain.text_splitter import CharacterTextSplitter

def basic_chunk(text: str, chunk_size: int = 512, chunk_overlap: int = 0) -> list[str]:
    """
    Splits text into fixed-size chunks with no overlap.
    chunk_size: number of characters per chunk
    chunk_overlap: 0 = no overlap (pure basic chunking)
    """
    splitter = CharacterTextSplitter(
        separator="",          # split on any character boundary
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    return splitter.split_text(text)


# Example
with open("document.txt") as f:
    raw = f.read()

chunks = basic_chunk(raw, chunk_size=512)
for i, c in enumerate(chunks[:3]):
    print(f"--- Chunk {i} ({len(c)} chars) ---")
    print(c[:120], "...\n")

Pros: trivially simple, deterministic, zero dependencies beyond a text splitter.

Cons: breaks sentences mid-word, destroys syntactic and semantic units, and a chunk at a boundary between two topics will embed poorly, diluting retrieval quality.

The formula for the number of chunks produced is straightforward:

$$N_{chunks} = lceil frac{L}{n} rceil$$

where $L$ is the document length in tokens/chars and $n$ is the chunk size. No rocket science, but also no grace.

When basic chunking makes sense

  • Rapid prototyping or baseline benchmarks where you just need something running.
  • Highly homogeneous corpora (e.g., log lines, structured records) where sentence boundaries are irrelevant.
  • Very small knowledge bases where retrieval quality is less critical.
  • As a stepping stone, measure its baseline recall@k before investing in fancier strategies.

2. Chunking with Overlap

The first and cheapest fix for basic chunking's context-loss problem is to introduce an overlap window: instead of advancing by $n$ tokens after each chunk, advance by $n - k$, keeping the last $k$ tokens of the previous chunk at the start of the next.

The intuition is that a sentence or concept that falls near a chunk boundary will be fully present in at least one of the two adjacent chunks, improving the odds that the retriever finds it.

Overlap chunking, sliding window

from langchain.text_splitter import RecursiveCharacterTextSplitter

def overlap_chunk(
    text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 128,
) -> list[str]:
    """
    RecursiveCharacterTextSplitter tries to split on paragraphs, then
    sentences, then words, falling back to characters only if needed.
    chunk_overlap: how many chars to repeat from the previous chunk.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)


chunks = overlap_chunk(raw, chunk_size=512, chunk_overlap=128)
print(f"Total chunks: {len(chunks)}")
# Verify overlap between consecutive chunks
overlap_sample = set(chunks[0].split()).intersection(set(chunks[1].split()))
print(f"Shared words between chunk 0 and 1: {len(overlap_sample)}")

With overlap $k$ and chunk size $n$, the effective stride is $s = n - k$, and the total number of chunks becomes:

$$N_{chunks} = lceil frac{L - n}{s} rceil + 1 = lceil frac{L - k}{n - k} rceil$$

A 25% overlap (e.g., $k=128$ on $n=512$) is a common starting point. Going beyond 50% overlap yields diminishing returns and bloats your index significantly.

Basic vs Overlap, quick comparison

PropertyBasic (no overlap)With overlap
Context continuity❌ Hard breaks✅ Bridged by overlap window
Index size✅ Minimal⚠️ Grows by factor n/(n-k)
Duplicate content✅ None⚠️ k tokens repeated
Implementation effort✅ Trivial✅ Trivial
Retrieval quality⚠️ Baseline✅ Noticeably better at boundaries

When overlap chunking makes sense

  • Any production RAG system that needs more than a prototype-quality baseline.
  • Long-form prose documents (reports, articles, books) where ideas span several paragraphs.
  • When you cannot afford the compute cost of semantic or LLM-based chunking but need some resilience at boundaries.
  • As the default fallback when more sophisticated strategies fail or are too slow.

3. Semantic Chunking

Both of the above approaches are structure-agnostic: they split text based on position, not meaning. Semantic chunking changes the frame entirely: instead of asking "have I reached n tokens?", it asks "does this sentence belong to the same topic as the previous one?"

There are two main flavours worth distinguishing: document-level semantic splitting using metadata (exploiting existing document structure) and embedding-based splitting using cosine distance (purely data-driven). Let's cover both.

3a. Semantic Chunking by Document Structure (Metadata)

Many documents already carry structural metadata: Markdown headings, HTML tags, PDF section headers, RST directives. Rather than ignoring this signal, we can use it directly as natural chunk boundaries.

LangChain's MarkdownHeaderTextSplitter is the canonical implementation for Markdown corpora. It walks the document and emits one chunk per logical section, attaching the heading hierarchy as metadata.

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

# ── Step 1: split on structural boundaries ──────────────────────────────────
headers_to_split_on = [
    ("#",  "H1"),
    ("##", "H2"),
    ("###","H3"),
]
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,          # keep heading text inside the chunk
)
md_chunks = md_splitter.split_text(raw_markdown)

# ── Step 2: further split sections that are still too large ─────────────────
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=128,
)
final_chunks = char_splitter.split_documents(md_chunks)

# ── Step 3: inspect metadata ─────────────────────────────────────────────────
for doc in final_chunks[:3]:
    print("METADATA:", doc.metadata)   # {"H1": "Introduction", "H2": "Background"}
    print("CONTENT: ", doc.page_content[:120])
    print()

Each resulting Document object carries its heading path in metadata, which you can later use for metadata filtering in the vector store, e.g., retrieve only chunks under H1 = "Methodology". This dramatically improves precision for structured corpora like technical documentation, academic papers, or legal contracts.

For HTML documents the same principle applies via HTMLHeaderTextSplitter; for PDFs that expose a table-of-contents tree, you can achieve the same by parsing the bookmark structure before chunking.

3b. Semantic Chunking via Cosine Distance (Embedding-Based)

When documents have no reliable structural metadata (or you want a purely data-driven split), you can let the embedding space decide where one topic ends and another begins.

The algorithm, popularised by Greg Kamradt and available in LangChain as SemanticChunker (LangChain docs):

1. Split the text into sentences (or small fixed windows as atomic units).

2. Embed each sentence.

3. Compute the cosine distance $d_i = 1 - cos(vec{e}_i, vec{e}_{i+1})$ between every consecutive pair.

4. Find breakpoints where $d_i$ exceeds a threshold $tau$, these are the semantic boundaries.

5. Merge sentences between breakpoints into a single chunk.

Cosine-distance semantic chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Any embedding model works, swap for HuggingFaceEmbeddings, VoyageAI, etc.
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# ── Three breakpoint strategies ───────────────────────────────────────────────
# "percentile"     → split at distances above the p-th percentile
# "standard_deviation" → split when distance > mean + z*std
# "interquartile"  → split when distance > Q3 + 1.5*IQR

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85,   # split at the top 15% of distance jumps
)

docs = chunker.create_documents([raw_text])

for i, doc in enumerate(docs[:4]):
    print(f"── Chunk {i} ({len(doc.page_content)} chars) ──")
    print(doc.page_content[:200])
    print()

The cosine distance between two unit-normalised embedding vectors $hat{e}_i$ and $hat{e}_{i+1}$ is:

$$d_i = 1 - frac{vec{e}_i cdot vec{e}_{i+1}}{|vec{e}_i| cdot |vec{e}_{i+1}|}$$

A value close to $0$ means consecutive sentences are topically coherent; a spike toward $1$ signals a topic shift and becomes a candidate boundary. The percentile strategy is the most robust in practice, it adapts the threshold to the distribution of distances in each document rather than using a hard-coded global value.

Metadata-based vs cosine-distance semantic chunking

DimensionMetadata / structure-basedCosine distance (embedding-based)
Requires document structure✅ Yes (headings, HTML…)❌ No, purely data-driven
Embedding cost at index time✅ None for splitting itself⚠️ One embed call per sentence
Works on unstructured prose❌ Poor✅ Excellent
Chunk boundary interpretability✅ Human-readable (heading path)⚠️ Black-box, distance spike
Metadata for filtering✅ Rich (section, chapter…)⚠️ None by default
Best forDocs, wikis, papers, contractsWeb scrapes, transcripts, books

When semantic chunking makes sense

  • Metadata-based: Technical documentation, knowledge bases, legal or scientific PDFs with clear section hierarchy. Combine with vector store metadata filters for precision retrieval.
  • Cosine-distance: Transcripts, news articles, scraped web content, or any unstructured prose where there is no reliable document skeleton.
  • Either variant is worth the extra cost when your corpus is large (millions of tokens) and retrieval precision matters more than indexing speed.
  • Avoid cosine-distance chunking when your embedding API has tight rate limits, one call per sentence can be expensive at scale.

4. Enriched Chunks

All the strategies above focus on where to cut. Enriched chunking asks a different question: what extra information can we attach to each chunk to make it more retrievable?

The core insight is that a chunk of text, removed from its document context, often loses the information that would make a retriever recognise it as relevant. "The company's revenue grew by 3% over the previous quarter" is a poor retrieval target without knowing which company, which quarter, or even which document it belongs to.

Enrichment strategies fall into two families: deterministic (rule-based, derived from structure) and LLM-based (generated by a model).

4a. Deterministic Enrichment

Deterministic enrichment derives extra signals from the document's existing structure, no model inference required. Common strategies include:

- Prepending document-level metadata (title, author, date, URL, section path) to every chunk so that the embedding captures this context.

- Breadcrumb prefix: a condensed path like [Report > Q2 2023 > Revenue Analysis] prepended as plain text.

- Sliding-window title injection: always include the most recently seen heading before each chunk, even if it was split away by the chunker.

- Parent-chunk reference: store a small child chunk for retrieval but return the larger parent chunk to the LLM (the "small-to-big" or "child-to-parent" pattern popularised by LlamaIndex).

from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

def enrich_deterministic(
    doc_text: str,
    metadata: dict,         # e.g. {"title": "Q2 Report", "section": "Revenue", "date": "2023-06"}
    chunk_size: int = 512,
    chunk_overlap: int = 64,
) -> list[Document]:
    """
    Prepends a breadcrumb derived from metadata to every chunk so that
    the embedding encodes document provenance alongside content.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    raw_chunks = splitter.split_text(doc_text)

    # Build a human-readable breadcrumb from available metadata fields
    breadcrumb_parts = filter(None, [
        metadata.get("title"),
        metadata.get("section"),
        metadata.get("date"),
    ])
    breadcrumb = " > ".join(breadcrumb_parts)

    enriched = []
    for chunk in raw_chunks:
        enriched_text = f"[{breadcrumb}]\n\n{chunk}" if breadcrumb else chunk
        enriched.append(Document(page_content=enriched_text, metadata=metadata))

    return enriched


# Usage
docs = enrich_deterministic(
    doc_text=raw_text,
    metadata={"title": "ACME Corp 10-K", "section": "Revenue", "date": "2023-06"},
)
print(docs[0].page_content[:300])

This is the fastest and cheapest enrichment approach, it requires no API calls, is fully reproducible, and can be computed at indexing time with zero latency overhead. The trade-off is that it can only surface information that is already explicit in the document structure. For documents that lack rich metadata or have ambiguous section titles, deterministic enrichment has limited upside.

> 💡 Tip: the small-to-big retrieval pattern pairs beautifully with deterministic enrichment. Index small (256-token) child chunks for precise retrieval, but store a pointer to a larger (1024-token) parent chunk that gets passed to the LLM. This gives you high retrieval precision and rich context for generation.

4b. LLM-Based Enrichment (Contextual Retrieval)

When deterministic enrichment isn't enough, typically because the document is dense, the chunks are ambiguous, or the corpus is large and heterogeneous, you can use an LLM to generate a short situating context for each chunk.

This is exactly what Anthropic describes in their Contextual Retrieval post. The idea is to prepend 50–100 tokens of LLM-generated context to each chunk before embedding it and before building the BM25 index. According to Anthropic's experiments, this reduces the top-20-chunk retrieval failure rate by 49% when combined with BM25, and by 67% when a reranker is added on top.

<document> 
{{WHOLE_DOCUMENT}} 
</document> 

Here is the chunk we want to situate within the whole document:
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 

Please give a short succinct context to situate this chunk within the overall 
document for the purposes of improving search retrieval of the chunk. 
Answer only with the succinct context and nothing else.
import anthropic
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

client = anthropic.Anthropic()

CONTEXT_PROMPT = """<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall
document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else."""


def generate_context(document: str, chunk: str) -> str:
    """
    Calls Claude to generate a 50-100 token situating context for the chunk.
    Uses prompt caching so the full document is only processed once, dramatically
    reducing cost when a document has many chunks.
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # fast + cheap for batch enrichment
        max_tokens=200,
        system="You are a retrieval optimisation assistant. Be concise.",
        messages=[
            {
                "role": "user",
                "content": [
                    # Cache the whole document, only paid once per document
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"},
                    },
                    {
                        "type": "text",
                        "text": (
                            f"Here is the chunk we want to situate:\n<chunk>\n{chunk}\n</chunk>\n"
                            "Give a short succinct context to situate this chunk within the "
                            "document for improving search retrieval. Answer only with the context."
                        ),
                    },
                ],
            }
        ],
    )
    return response.content[0].text.strip()


def enrich_with_llm(
    doc_text: str,
    metadata: dict,
    chunk_size: int = 800,
    chunk_overlap: int = 100,
) -> list[Document]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    raw_chunks = splitter.split_text(doc_text)

    enriched = []
    for chunk in raw_chunks:
        context = generate_context(doc_text, chunk)
        # Prepend context to the chunk before embedding
        enriched_text = f"{context}\n\n{chunk}"
        enriched.append(Document(page_content=enriched_text, metadata=metadata))

    return enriched

Cost note from Anthropic: with prompt caching enabled, assuming 800-token chunks, 8K-token documents, and ~100 tokens of generated context per chunk, the one-time enrichment cost comes to roughly $1.02 per million document tokens, a small price for the retrieval gains it unlocks. See the Anthropic Cookbook for a production-ready implementation.

Beyond situating context, LLMs can also generate hypothetical questions that a chunk answers, a technique known as HyDE (arxiv:2212.10496) in reverse. Embedding questions about the chunk alongside the chunk itself helps bridge the lexical gap between how users phrase queries and how information is stored.

def generate_hypothetical_questions(chunk: str, n: int = 3) -> list[str]:
    """
    Ask the LLM to generate n questions that this chunk answers.
    These questions are embedded alongside the chunk, bridging the
    query–document lexical gap.
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[
            {
                "role": "user",
                "content": (
                    f"Given this text:\n<chunk>\n{chunk}\n</chunk>\n\n"
                    f"Generate {n} concise questions that this text directly answers. "
                    "Return one question per line, no bullets or numbering."
                ),
            }
        ],
    )
    return [q.strip() for q in response.content[0].text.strip().split("\n") if q.strip()]


# Example: embed questions + chunk together
def enrich_with_questions(doc_text: str, metadata: dict) -> list[Document]:
    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    raw_chunks = splitter.split_text(doc_text)

    enriched = []
    for chunk in raw_chunks:
        questions = generate_hypothetical_questions(chunk, n=3)
        questions_block = "\n".join(f"Q: {q}" for q in questions)
        enriched_text = f"{questions_block}\n\n{chunk}"
        enriched.append(Document(page_content=enriched_text, metadata=metadata))

    return enriched

Deterministic vs LLM enrichment

DimensionDeterministicLLM-based (Contextual Retrieval)
Cost✅ Free⚠️ ~$1/M tokens (with caching)
Speed (indexing)✅ Instant⚠️ One API call per chunk
Context quality⚠️ Limited to existing metadata✅ Rich, semantic, cross-chunk aware
Works on unstructured docs❌ Poorly✅ Excellent
Retrieval failure reduction⚠️ Moderate✅ Up to 49–67% (Anthropic)
Reproducibility✅ Fully deterministic⚠️ Slight non-determinism
Best forStructured corpora, tight budgetsLarge heterogeneous knowledge bases

When enriched chunking makes sense

  • Deterministic: Any time you have reliable document metadata (section headers, dates, authors, document type). Zero cost, high ROI. Always do this as a baseline before considering LLM enrichment.
  • LLM-based: When retrieval quality is a primary product concern (customer-facing chatbots, legal search, medical Q&A) and you have budget for a one-time indexing cost. The 49–67% failure-rate reduction is hard to match with any other single technique.
  • Hypothetical questions: Particularly valuable when users ask questions in natural language that differ structurally from how the source material is written, e.g., a regulatory corpus queried in plain English.
  • Skip LLM enrichment if your knowledge base changes frequently (re-enrichment cost) or if your chunks are already short and self-contained enough to embed well without context.

5. When to Use Each Strategy, Summary

No single chunking strategy wins in all scenarios. The right choice depends on your corpus structure, latency budget, indexing frequency, and how much retrieval quality matters to the end user.

Chunking strategies at a glance

StrategyBest corpus typeIndexing costRetrieval qualityRecommended when
Basic (fixed-size)Homogeneous / logs✅ Minimal⚠️ LowPrototypes, baselines, structured records
OverlapGeneral prose✅ Minimal✅ ModerateDefault production starting point
Semantic (metadata)Docs, wikis, papers✅ Low✅ Good + filterableStructured corpora with clear headings
Semantic (cosine)Unstructured prose⚠️ Medium (embed per sentence)✅ GoodTranscripts, scraped content, books
Enriched (deterministic)Any + metadata available✅ Low✅ GoodAlways as a baseline enrichment layer
Enriched (LLM)Heterogeneous / dense⚠️ Medium (~$1/M tokens)✅✅ ExcellentProduction systems, customer-facing RAG

Decision tree, choosing your chunking strategy

The best RAG systems layer these strategies: start with an appropriate structural split, add overlap for resilience at boundaries, apply deterministic enrichment for free context injection, and, where quality demands it, invoke LLM-based contextualisation. Anthropic's Contextual Retrieval results show that these gains are additive: combining contextual embeddings + BM25 + a reranker reduces failure rate by 67% compared to vanilla embeddings alone.

Start simple, measure recall@k at each step, and add complexity only where the numbers justify it.