Navigating AI Frontier – Golden Datasets, Sharp Metrics, Strong RAG , A Practical Guide

claudio-schwarz-fyeOxvYvIyY-unsplash (1)

In the previous edition, we peeled back the glossy layer of RAG and exposed why AI sometimes gives hilariously dangerous advice like suggesting glue on a pizza. We learned that this wasn’t “AI hallucination” but the side effect of poor RAG implementation: wrong chunk sizes, sloppy overlaps, and mismatched retrieval strategies. We also explored how different document types from FAQs to mystery novels require different RAG configurations, and why blindly relying on larger context windows won’t magically rescue flawed design. In short, RAG isn’t the villain; bad implementation is.

Link to previous article – https://www.architectureandgovernance.com/applications-technology/navigating-ai-frontier-before-you-blame-ai-hallucination-check-your-rag/

So how do we measure the efficiency of a RAG system?

It’s not easy but absolutely doable. The starting point is building a Golden Dataset, a list of test cases containing questions, correct answers, and the specific chunks where those answers live. This is generally stored in JSONL format (JSON Lines) and becomes the ground truth for evaluating your RAG.

Sample Golden Dataset:

{“id”: “q1”, “query”: “What is the price of an adult ticket?”, “expected_answer”: “The adult ticket price is ₹250.”, “relevant_chunks”: [“chunk_12”, “chunk_13”]}

{“id”: “q2”, “query”: “Does the insurance policy include dental coverage?”, “expected_answer”: “Yes, dental coverage is included under outpatient benefits.”, “relevant_chunks”: [“chunk_07”]}

{“id”: “q3”, “query”: “What documents are required for a visa application?”, “expected_answer”: “Passport, photographs, bank statements, and an invitation letter are required.”, “relevant_chunks”: [“chunk_22”, “chunk_23”, “chunk_24”]}

{“id”: “q4”, “query”: “Explain the refund policy for cancelled tickets.”, “expected_answer”: “A full refund is provided if the ticket is cancelled 24 hours before the event.”, “relevant_chunks”: [“chunk_31”, “chunk_32”]}

Once the dataset is ready, we evaluate the RAG using these metrics:

MRR – Mean Reciprocal Rank: How early did the correct answer appear?
nDCG – Normalized Discounted Cumulative Gain: How good is the ranking quality?
Recall@K: Out of all the right answers, how many were returned in the top K?
Precision@K: Out of the top K results, how many were correct?

Example: -1 MRR

You ask: “What is Apple’s CEO name?”

Search returns:

Apple iPhone news

Apple stock price

Tim Cook bio ← correct response is here

Correct answer is at rank 3 → MRR penalizes you. MRR in this example is 1/3

If it were rank 1, MRR score would be higher.

Example: -2 nDCG

If a user searches for “iPhone”, the system should rank:

iPhone 15 Pro (very relevant)

iPhone 14 (relevant)

iPhone case (less relevant)

Samsung cover (not relevant)

nDCG checks the quality of ranking across the entire list not just whether the right answer exists somewhere.

Example: -3 Recall@k and Precision@k

Imagine you have a box with:

10 total chocolates

4 of them are your favourite flavor (relevant items)

The rest are flavors you don’t like (irrelevant items)

Your task:

Pick 5 chocolates from the box (this is like top K retrieval).

Your picks: Picked chocolates = {Fav, NotFav, Fav, NotFav, NotFav}

Recall@5 = 2/4 = 50% – > Did you find all the good chocolates?

Precision@5 = 2/5 = 40% -> Did you avoid the bad chocolates when picking?

What next? Optimizing RAG efficiency.

Once you establish baseline scores, the next step is improving RAG performance. You can fine-tune key parameters such as Chunk Size, Overlap Size, Embedding Model, and Number of Chunks Retrieved.

But real efficiency comes from looking at the entire workflow holistically including pre-processing, data cleaning, semantic chunking, and query rewriting. All of these influence retrieval quality more than people realise.

But wait should we even bother? With LLM context windows expanding every few months, many ask:

“Is RAG dead? Can’t we just shove all documents into the prompt?”

A tempting fantasy, but the architectural reality is far more nuanced. We’ll decode that debate in the final edition of this series. Till then… stay tuned.

Shammy Narayanan is the Vice President of Platform, Data, and AI at Welldoc. Holding 11 cloud certifications, he combines deep technical expertise with a strong passion for artificial intelligence. With over two decades of experience, he focuses on helping organizations navigate the evolving AI landscape. He can be reached at shammy45@gmail.com