System Active // V3.0.0
LOC: 23.8103° N
All posts
Feb 18, 2026·4 min read

Six Blockers to Production RAG: Building DocuSense

RAGAI EngineeringLLMsVector Search

Most RAG demos look great until a real user asks a real question. DocuSense — a system for asking precise questions over your PDFs — was no different at the start. What follows is the honest build log: every blocker I hit, in order, and the change that actually got me past it.

Blocker 1 — "one person's experience" returned three people

My very first real test broke it. I uploaded three CVs into one session and asked for a single person's experience. The pipeline cheerfully returned bits of all three — blended together as if they were one résumé.

The cause was the chunking. I was splitting every document into fixed-size pieces, and a chunk that starts mid-section carries no signal about who or what it belongs to. "Led the backend team" is meaningless on its own — whose backend team?

I read through Anthropic's writeup on contextual retrieval and switched to context-aware chunking: before embedding a chunk, I prepend a short, model-generated note about where it sits — which document, which person, which section.

context = llm.summarize(
    f"Document: {doc_title}\nCandidate: {candidate_name}\n\nChunk:\n{chunk}"
)
embedding = embed(f"{context}\n\n{chunk}")

Now the embedding encodes intent and ownership, not just words. The three CVs stopped bleeding into each other.

Blocker 2 — huge documents, and the exact line is nowhere

Context-aware chunking fixed identity, but on large documents the retriever still missed. Someone would ask about an exact clause — a specific number, a code, a defined term — and dense vector search would return something semantically near but not the actual line.

Dense embeddings are great at "what you mean" and bad at "what you typed." So I added a sparse retriever alongside the dense one — BM25 for exact-term matching — and fused the two with reciprocal-rank fusion. Hybrid search.

Dense finds the meaning. Sparse finds the exact token. Real questions need both.

The exact-clause queries started landing on the exact clause.

Blocker 3 — highlighting the answer inside the PDF

An answer you can't point to isn't trustworthy. I wanted the precise source sentence highlighted inside the original PDF, which means I needed real bounding boxes for text — not just a page number.

I tried three parsers: PyMuPDF, Apache Tika, and Docling. PyMuPDF and Tika were fine for raw text but fiddly for reliable layout coordinates across messy real-world PDFs. Docling performed best — it gave me clean bounding boxes I could map straight back onto the rendered page, so every answer now highlights its exact source.

Blocker 4 — questions that needed the outside world

Some questions simply weren't answerable from the uploaded documents. So I wired in web search as an additional tool, letting the agent reach outside the corpus when the documents don't contain the answer — while still grounding and attributing whatever it pulls in.

Blocker 5 — running it on-prem, and showing the model think

For deployments that can't send documents to a hosted API, I connected an on-prem LLM — the GLM model — so the whole pipeline runs inside the customer's own walls. I also surfaced the model's reasoning ("thinking") in the UI, so users can see why an answer was given, not just the answer. Then I added vision models so the system could actually read images and scanned pages rather than skipping them.

Blocker 6 — the images inside the PDF were invisible to retrieval

Reading an image when asked is one thing; retrieving the right image for a question is another. Charts, diagrams, and figures had no text, so they never surfaced in search.

The fix: run every embedded image through a small VLM to generate a text description, then add those descriptions into the chunk index alongside the document text. Suddenly images were first-class search results — ask about a diagram and the diagram comes back.


None of this was one clever trick. It was a stack of specific failures — blended CVs, missed clauses, unhighlightable answers, invisible images — and a specific fix for each. That's what "production RAG" actually means: not a magic retriever, but a pipeline that has met real documents and survived them.

Keep reading