Six Blockers to Production RAG: Building DocuSense
Most RAG demos look great until a real user asks a real question. DocuSense — a system for asking precise questions over your PDFs — was no different at the start. What follows is the honest build log: every blocker I hit, in order, and the change that actually got me past it.
Blocker 1 — "one person's experience" returned three people
My very first real test broke it. I uploaded three CVs into one session and asked for a single person's experience. The pipeline cheerfully returned bits of all three — blended together as if they were one résumé.
The cause was the chunking. I was splitting every document into fixed-size pieces, and a chunk that starts mid-section carries no signal about who or what it belongs to. "Led the backend team" is meaningless on its own — whose backend team?
I read through Anthropic's writeup on contextual retrieval and switched to context-aware chunking: before embedding a chunk, I prepend a short, model-generated note about where it sits — which document, which person, which section.
context = llm.summarize(
f"Document: {doc_title}\nCandidate: {candidate_name}\n\nChunk:\n{chunk}"
)
embedding = embed(f"{context}\n\n{chunk}")
Now the embedding encodes intent and ownership, not just words. The three CVs stopped bleeding into each other.
Blocker 2 — huge documents, and the exact line is nowhere
Context-aware chunking fixed identity, but on large documents the retriever still missed. Someone would ask about an exact clause — a specific number, a code, a defined term — and dense vector search would return something semantically near but not the actual line.
Dense embeddings are great at "what you mean" and bad at "what you typed." So I added a sparse retriever alongside the dense one — BM25 for exact-term matching — and fused the two with reciprocal-rank fusion. Hybrid search.
Dense finds the meaning. Sparse finds the exact token. Real questions need both.
The exact-clause queries started landing on the exact clause.
Blocker 3 — highlighting the answer inside the PDF
An answer you can't point to isn't trustworthy. I wanted the precise source sentence highlighted inside the original PDF, which means I needed real bounding boxes for text — not just a page number.
I tried three parsers: PyMuPDF, Apache Tika, and Docling. PyMuPDF and Tika were fine for raw text but fiddly for reliable layout coordinates across messy real-world PDFs. Docling performed best — it gave me clean bounding boxes I could map straight back onto the rendered page, so every answer now highlights its exact source.
Blocker 4 — questions that needed the outside world
Some questions simply weren't answerable from the uploaded documents. So I wired in web search as an additional tool, letting the agent reach outside the corpus when the documents don't contain the answer — while still grounding and attributing whatever it pulls in.
Blocker 5 — running it on-prem, and showing the model think
For deployments that can't send documents to a hosted API, I connected an on-prem LLM — the GLM model — so the whole pipeline runs inside the customer's own walls. I also surfaced the model's reasoning ("thinking") in the UI, so users can see why an answer was given, not just the answer. Then I added vision models so the system could actually read images and scanned pages rather than skipping them.
Blocker 6 — the images inside the PDF were invisible to retrieval
Reading an image when asked is one thing; retrieving the right image for a question is another. Charts, diagrams, and figures had no text, so they never surfaced in search.
The fix: run every embedded image through a small VLM to generate a text description, then add those descriptions into the chunk index alongside the document text. Suddenly images were first-class search results — ask about a diagram and the diagram comes back.
None of this was one clever trick. It was a stack of specific failures — blended CVs, missed clauses, unhighlightable answers, invisible images — and a specific fix for each. That's what "production RAG" actually means: not a magic retriever, but a pipeline that has met real documents and survived them.
Keep reading
Jun 11, 2026
When Comfort Is the Symptom: Rethinking AI Sycophancy
Asked to critique one paper like a collaborator, I landed on an uncomfortable conclusion — the most dangerous harm an AI can do might be the one that feels like help.
Jun 6, 2026
Catching a Container in the Act: An eBPF Intrusion Detector for Kubernetes
My undergraduate thesis journey — a year of kernel tracing, broken signatures, and 700 simulated attacks that taught me the honest answer is usually more interesting than the clean one.