Chunk-Then-Summarise: The Embedding Pipeline That Worked - Writing

The first version of my RAG pipeline was simple: split PDFs into chunks, embed the chunks, store the vectors. Query time: embed the question, find similar chunks, stuff them into the prompt.

The retrieval was mediocre. Relevant documents came back, but so did noise - partial sentences, formatting artefacts, headers without context, table fragments that meant nothing without their column labels.

The problem with raw chunks

PDFs are not structured data. They’re visual documents encoded as text with arbitrary line breaks, headers that float above their content, footnotes that interrupt paragraphs, and tables that serialize into nonsense.

When you split a PDF into 8,000-character chunks, you get fragments. A chunk might start mid-sentence from the previous page. It might contain a header for a section whose content is in the next chunk. It might have a table row separated from its column headers.

Embedding these fragments produces vectors that represent noise as much as signal. The embedding model can’t recover meaning the chunk doesn’t contain.

Summarise first, embed second

The fix was adding an LLM summarisation step between chunking and embedding.

Each chunk gets sent to Claude or GPT with a simple instruction: summarise the key information in this text, preserving the main concepts and their relationships. The summary strips formatting noise, completes partial sentences, contextualises table data, and produces clean, coherent text.

Then I embed the summary, not the raw chunk. The original chunk is stored alongside the summary for retrieval - when a search matches, the user sees the source text, not the summary.

The difference

With raw chunk embeddings, a query about “BACP supervision requirements” might return a chunk that happens to contain the words “BACP” and “supervision” but is a table of contents entry. The embedding is similar because the words match, but the content is useless.

With summarised embeddings, the same query returns chunks where the summary explicitly describes BACP supervision requirements. The embedding captures the meaning, not just the vocabulary.

Cost and latency

Summarisation adds cost and time. Every chunk gets an LLM call during ingestion. For 70+ documents with hundreds of chunks, that’s a meaningful bill.

But ingestion is a one-time cost. Search quality improvement is permanent. For a knowledge base that gets queried thousands of times after being ingested once, the per-query summarisation cost is amortised away.

I used a cheaper model for summarisation (GPT-4o-mini or Claude Haiku) since the task is straightforward. The summaries don’t need to be creative - they need to be accurate and clean.

Pipeline

The full pipeline:

Extract text from PDF
Split into chunks (8,000 characters, overlap at paragraph boundaries)
Summarise each chunk with an LLM
Embed the summary (originally with OpenAI’s text-embedding-3-small, later switched to a local Ollama model)
Store the vector, summary, and original chunk together
Index with metadata (document ID, page number, document type)

Step 3 is the only addition to the naive pipeline. It transformed retrieval from “sometimes relevant” to “reliably relevant.”