Navigating AI Frontier – Stop the Funeral: RAG Is Very Much Alive

In the previous edition, we tore into the mechanics of RAG and uncovered why AI sometimes gives outrageously wrong advice—like recommending glue for pizza toppings. We learned that these blunders aren’t “AI hallucinations” but symptoms of bad RAG design: wrong chunk sizes, poor overlaps, and mismatched retrieval strategies. We also saw how building a Golden Dataset and measuring metrics like MRR, nDCG, Recall@K, and Precision@K help evaluate whether a RAG system is working. And just when things got interesting, we left off with a provocative question: with context windows exploding, is RAG becoming obsolete?

With context windows expanding, the obvious question pops up: Do we still need RAG?
The short, unromantic answer: Absolutely yes. Bigger context windows don’t magically replace retrieval—they only make it more expensive and inefficient to avoid it.

First, shoving all documents into the system prompt inflates token usage. This means every single query no matter how trivial carries the weight (and cost) of your entire knowledge base. Ask “What is the refund policy?” and you pay for the whole 200-page document again. Ask it five times in a conversation, and you pay for it five times. This is token bleeding in slow motion.

Second, even with open-source LLMs, bypassing RAG hurts performance. Loading thousands of pages into a prompt forces the model to “think across” an unnecessarily massive input. It’s the computational equivalent of asking someone to read an entire encyclopedia before answering, “Where is the nearest ATM?” More text doesn’t mean more accuracy it often means more noise.

For example:

If your policy handbook is 500 pages, feeding all 500 into the context window for every query is wasteful. RAG retrieves just the 2–3 pages where the answer actually lives.
Even a 1M-token context window becomes meaningless if the model must scan through 100,000 irrelevant tokens to find one relevant paragraph.

In short, without RAG, your model is searching for a needle in a haystack except now you keep paying to rebuild the entire haystack every time.

So now that we’ve established RAG is very much alive, the real question is: How do we make it smarter, faster, and more reliable? Improving RAG efficiency isn’t about tweaking one magic parameter it’s a combination of design choices that compound over time. Here are some best practices that actually move the needle, each with a simple, real-world example:

Experiment with Chunking Strategies (including Semantic Chunking)
Different documents demand different chunk sizes. Narrative-heavy content needs larger chunks; FAQ-style content needs smaller ones.

Example: A 300-page HR policy manual contains sections like “Bonuses,” “Leaves,” and “Compliance.” Semantic chunking groups each concept meaningfully instead of arbitrarily slicing every 500 words.

Experiment with Encoders (PDF, Word, Images → Prefer Markdown)
RAG fails when your raw documents are messy. Converting PDFs, tables, and images to clean Markdown makes embeddings more accurate.

Example: A PDF with misaligned columns (“₹250 Adult Ticket” broken into three lines) becomes perfectly machine-readable when converted to Markdown.

Improve Prompts (Post-Retrieval Prompting)
Once relevant chunks are fetched, the LLM still needs guidance. Clear instructions reduce hallucination and set tone.

Example: After retrieval, a prompt like:
“Use only the provided context. If information is missing, say ‘Not available in documents.’” prevents the model from guessing.

Document Pre-processing (Rewrite Before Storing)
Use an LLM to rewrite raw inputs into cleaner, consistent formats—especially tables, lists, and messy PDFs.

Example: Ticket prices scattered across paragraphs are converted into a uniform table before embedding, improving search recall dramatically.

Query Rewriting (Before Retrieval)
Users type vague or incomplete queries. Let an LLM rewrite them into precise search queries.

Example: User asks: “What’s the leave rule?”
Query rewriting turns it into:
“Summarize the eligibility, duration, and approval process for Earned Leave.”
leading to better retrieval.

Query Expansion (Generate Multiple Search Variants)
Instead of one query, generate 3–5 semantically related variants to capture wider coverage.
Example: For “health insurance coverage,” expand into:

“medical reimbursement”

“outpatient benefits”

“hospitalization policy”
increasing the chance of finding the right chunk.

Re-ranking (LLM Re-ranks Retrieved Chunks)
After retrieval, let the LLM reorder the results based on true relevance.

Example: RAG pulls 10 chunks for “visa documents”; the LLM re-ranks them so the chunk listing passport + bank statements comes first.

Hierarchical Retrieval (Useful for Large, Nested Documents)
For deeply layered documents (policy → sub-policy → clause), use hierarchical search: first find the section, then the subsection.

Example: In a 200-page “Employee Handbook,” retrieve the “Compensation” section first, then drill down into “Variable Pay Policies.”

Agentic RAG (Tools + Memory + Reasoning)
Use RAG with reasoning loops: the model retrieves, thinks again, calls a tool, re-checks memory, and refines the answer.

Example: An internal helpdesk bot retrieves “Laptop Replacement Policy,” cross-checks inventory from a tool, and responds with both the rule and the availability.

At the end of this three-part journey, one truth stands tall: RAG isn’t a temporary patch—it’s a foundational layer of modern AI systems. Larger context windows may change how we design RAG, but they do not replace why we need it. Retrieval isn’t just about shrinking tokens; it’s about enforcing precision, grounding reasoning, respecting data structures, and building systems that scale without bleeding cost, compute, or sanity. The future of AI won’t be model-only or retrieval-only it will be hybrid, where smart retrieval meets smart reasoning. If we want AI that doesn’t just answer but answers correctly, consistently, and responsibly then RAG isn’t dying. It’s evolving. And the teams that understand this nuance will build the next generation of truly intelligent systems. Stay curious, keep experimenting, and let retrieval be your competitive edge.

Shammy Narayanan is the Vice President of Platform, Data, and AI at Welldoc. Holding 11 cloud certifications, he combines deep technical expertise with a strong passion for artificial intelligence. With over two decades of experience, he focuses on helping organizations navigate the evolving AI landscape. He can be reached at shammy45@gmail.com