Most teams ship RAG systems with no eval beyond "the demo answers look good." Six months in, a customer reports that the assistant confidently cited a policy that doesn't exist, and there's no signal in the team's metrics that anything was wrong. The fix is to measure three orthogonal dimensions: did the retriever find the right context, did the LLM stay faithful to the retrieved context, and did the final answer actually address the user's question. After shipping 15+ production RAG systems at NKKTech, these are the three metrics that catch regressions and the targets we hold our deployments to.
Why RAG Eval is Harder Than LLM Eval
Pure-LLM eval has one stage to score (the model output). RAG has at least three: retrieval, augmentation, generation. A bad final answer can be caused by retrieval failure (wrong documents pulled), augmentation failure (right documents but bad chunking or reranking), or generation failure (right context but the LLM ignored or contradicted it). End-to-end eval can't tell you which. The three metrics below isolate each stage, so when a regression lands you can diagnose in minutes instead of days.
Metric 1: Retrieval Precision
Precision@k = fraction of the top-k retrieved chunks that are actually relevant to the query, judged by a human-curated gold set. For each eval case, you specify which chunks (by document ID and span) should appear in the top-k. The retriever returns its top-k; the metric is the overlap. Target: 0.7–0.85 at k=5 for B2B knowledge-base workloads; below 0.6 indicates a chunking or embedding-model problem. Implementation: a curated gold set of 50–100 query→relevant-chunks mappings, refreshed quarterly. Critical detail: precision must be measured with the same metadata filters production uses; benchmark precision without filters routinely overstates real-world precision by 20–30%.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
Metric 2: Faithfulness (No Hallucination)
Faithfulness = fraction of factual claims in the answer that are directly supported by the retrieved context. Measured with LLM-as-judge: pass the retrieved chunks + the generated answer to a separate scoring LLM (Claude or GPT-4o) with a prompt like "For each factual claim in the answer, is it directly supported by the provided context? Return a list with verdict + supporting snippet." Score is the fraction of supported claims. Target: 0.92+ for compliance-sensitive workloads (legal, medical, regulatory); 0.85+ for general B2B knowledge bases; below 0.8 means the generation model is hallucinating and you need either a stricter prompt, a smaller-context configuration (force the model to stay grounded), or a more capable model. This is the metric that catches the "confidently cited a policy that doesn't exist" failure mode.
Metric 3: Answer Relevance
Relevance = how well the answer actually addresses the user's question, regardless of whether the retrieval was correct. A RAG system can have great retrieval and faithful generation but still produce useless answers if the LLM goes off on a tangent or misses the actual intent. Measured with LLM-as-judge: pass the original query + the generated answer to a scoring LLM with a rubric ("Does the answer address the user's specific question? Is it appropriately scoped — not too short, not padded?"). Score is 1–5; we typically aim for averages above 4.2. Low relevance + high faithfulness usually means a prompt-engineering problem — the system prompt is letting the LLM ramble or hedge instead of answering. Low relevance + low precision usually means a retrieval problem masquerading as a generation problem.
Combining the Three Into a Release Gate
We use a composite RAG-Score = 0.4 × Precision@5 + 0.4 × Faithfulness + 0.2 × Relevance. Production deploys are blocked if RAG-Score drops more than 3% from the previous release on the full eval set. We also track each metric independently so we can diagnose where a regression came from. The composite score becomes the single number that goes in the PR description, daily-standup dashboard, and quarterly business review. For the broader RAG architectural picture — chunking, reranking, embedding-model selection, hybrid search — see our RAG Implementation Playbook for 2026.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Want to build this with NKKTech?
Need help setting up RAG eval on a system you've already shipped? Book a free 30-minute review — we'll evaluate your current setup and suggest a minimum viable eval suite for your top use case.
Book a Free Call