Feb 18, 2026·4 min read

Six Things That Broke My RAG in Production

RAGAI EngineeringLLMsVector Search

Most RAG demos look great until a real user asks a real question. DocuSense — a system for asking precise questions over your PDFs — was no different at the start. What follows is the honest build log: every blocker I hit, in order, and the change that actually got me past it.

Blocker 1 — "one person's experience" returned three people

My very first real test broke it. I uploaded three CVs into one session and asked for a single person's experience. The pipeline cheerfully returned bits of all three — blended together as if they were one résumé.

The cause was the chunking. I was splitting every document into fixed-size pieces, and a chunk that starts mid-section carries no signal about who or what it belongs to. "Led the backend team" is meaningless on its own — whose backend team?

I read through Anthropic's writeup on contextual retrieval and switched to context-aware chunking: before embedding a chunk, I prepend a short, model-generated note about where it sits — which document, which person, which section.

context = llm.summarize(
    f"Document: {doc_title}\nCandidate: {candidate_name}\n\nChunk:\n{chunk}"
)
embedding = embed(f"{context}\n\n{chunk}")

Now the embedding encodes intent and ownership, not just words. The three CVs stopped bleeding into each other.

Blocker 2 — huge documents, and the exact line is nowhere

Context-aware chunking fixed identity, but on large documents the retriever still missed. Someone would ask about an exact clause — a specific number, a code, a defined term — and dense vector search would return something semantically near but not the actual line.

Dense embeddings are great at "what you mean" and bad at "what you typed." So I added a sparse retriever alongside the dense one — BM25 for exact-term matching — and fused the two with reciprocal-rank fusion. Hybrid search.

Dense finds the meaning. Sparse finds the exact token. Real questions need both.

The exact-clause queries started landing on the exact clause.

Blocker 3 — highlighting the answer inside the PDF

An answer you can't point to isn't trustworthy. I wanted the precise source sentence highlighted inside the original PDF, which means I needed real bounding boxes for text — not just a page number.

I tried three parsers: PyMuPDF, Apache Tika, and Docling. PyMuPDF and Tika were fine for raw text but fiddly for reliable layout coordinates across messy real-world PDFs. Docling performed best — it gave me clean bounding boxes I could map straight back onto the rendered page, so every answer now highlights its exact source.

Blocker 4 — questions that needed the outside world

Some questions simply weren't answerable from the uploaded documents. So I wired in web search as an additional tool, letting the agent reach outside the corpus when the documents don't contain the answer — while still grounding and attributing whatever it pulls in.

Blocker 5 — running it on-prem, and showing the model think

For deployments that can't send documents to a hosted API, I connected an on-prem LLM — the GLM model — so the whole pipeline runs inside the customer's own walls. I also surfaced the model's reasoning ("thinking") in the UI, so users can see why an answer was given, not just the answer. Then I added vision models so the system could actually read images and scanned pages rather than skipping them.

Blocker 6 — the images inside the PDF were invisible to retrieval

Reading an image when asked is one thing; retrieving the right image for a question is another. Charts, diagrams, and figures had no text, so they never surfaced in search.

The fix: run every embedded image through a small VLM to generate a text description, then add those descriptions into the chunk index alongside the document text. Suddenly images were first-class search results — ask about a diagram and the diagram comes back.

None of this was one clever trick. It was a stack of specific failures — blended CVs, missed clauses, unhighlightable answers, invisible images — and a specific fix for each. That's what "production RAG" actually means: not a magic retriever, but a pipeline that has met real documents and survived them.

Keep reading

Jul 9, 2026

Webhooks: The System Design Interview Answer That Actually Runs in Production

Webhooks look like the easiest thing in distributed systems, it is just an HTTP POST. Most implementations still quietly lose events, double-charge customers, and expose endpoints that process anything an attacker sends. Here is how to build both sides properly: transactional outbox, retries with backoff, HMAC verification, idempotent consumers, and the boring nightly reconciliation job that makes the whole thing trustworthy.

Jul 6, 2026

I Built a Site That Roasts Your CV

A weekend idea: what if a CV review felt less like feedback and more like getting dragged by four different people who all hate your resume for different reasons. 233 roasts and 23 battles later, here's the build behind getroasted.live: the model fallback chain, the persona system, and the guardrails that keep 'savage' from turning into 'reported'.