Evaluating RAG Systems: Retrieval Metrics, LLM Metrics and RAGAS
Building a RAG pipeline is only half the job. The harder question is: how good is it? Without a rigorous evaluation framework, you are flying blind, unable to tell whether a prompt change improved the system, whether a new embedding model helps, or whether your chunking strategy is losing relevant context.
RAG evaluation splits naturally into two distinct layers:
- Retrieval metrics, did the retrieval stage find the right documents?
- Generation metrics, given the retrieved context, did the LLM produce a correct, faithful, and complete answer?
RAGAS (Retrieval-Augmented Generation Assessment, arxiv:2309.15217) is the de-facto open-source framework for evaluating both layers end-to-end. It uses LLM-as-a-judge internally for the metrics that require semantic understanding, and provides a clean Python API over a standardised EvaluationDataset.
1. Retrieval Metrics
Retrieval metrics evaluate whether the chunks surfaced by your retrieval stage are the right ones. They operate over a set of retrieved documents $R$ and a set of ground-truth relevant documents $G$.
These metrics assume you have labelled data, a set of questions each annotated with the document chunks that are actually relevant. RAGAS can help generate this dataset automatically (see Section 4).
1a. Context Precision and Context Recall
The two most fundamental retrieval metrics are precision and recall, adapted for the RAG context window:
Context Precision measures what fraction of the retrieved chunks are actually relevant to answering the question. A high precision means the LLM is not distracted by irrelevant noise:
$$text{Context Precision} = frac{|R cap G|}{|R|}$$
Context Recall measures what fraction of the ground-truth relevant chunks were actually retrieved. A high recall means the retriever is not missing key information the LLM needs:
$$text{Context Recall} = frac{|R cap G|}{|G|}$$
The classic precision-recall trade-off applies here too: retrieving more chunks increases recall but dilutes precision. Tuning the top-$k$ parameter is essentially navigating this trade-off.
In RAGAS, both metrics are computed with LLM-as-judge: rather than exact chunk ID matching (which requires perfect labelling), the judge LLM determines whether each retrieved chunk is relevant to the question and whether the ground-truth answer's claims are covered by the retrieved context. This makes them robust to paraphrase and partial overlaps.
1b. MRR and nDCG, Ranking Quality
Precision and recall treat all retrieved documents equally. In practice, the order matters enormously: the first chunk in the context window is more likely to be read carefully by the LLM than the last one. Ranking-aware metrics capture this.
Mean Reciprocal Rank (MRR) measures how high the first relevant document appears in the ranked results. For a set of queries $Q$:
$$text{MRR} = frac{1}{|Q|} sum_{i=1}^{|Q|} frac{1}{text{rank}_i}$$
where $text{rank}_i$ is the position of the first relevant document for query $i$. MRR = 1.0 means the first result is always correct; MRR = 0.5 means the first relevant result is typically at position 2.
Normalised Discounted Cumulative Gain (nDCG) generalises MRR to multiple relevant documents and supports graded relevance. It compares the ranking produced by your retriever against the ideal ranking:
$$text{nDCG@k} = frac{text{DCG@k}}{text{IDCG@k}}, quad text{where} quad text{DCG@k} = sum_{i=1}^{k} frac{text{rel}_i}{log_2(i+1)}$$
nDCG is the standard metric on the BEIR and MTEB retrieval benchmarks, the numbers you see on leaderboards for embedding models are nDCG@10 scores.
1c. Hit Rate
Hit Rate@k is the simplest retrieval metric: for what fraction of queries does at least one relevant document appear in the top-$k$ results?
$$text{Hit Rate@k} = frac{1}{|Q|} sum_{i=1}^{|Q|} mathbb{1}[G_i cap R_i^{(k)} neq emptyset]$$
It is a binary, all-or-nothing metric, it does not care whether there are one or ten relevant documents in the top-$k$, only that at least one is there. This makes it easy to interpret and quick to compute. Hit Rate@5 = 0.87 means 87% of queries had at least one relevant chunk in the top-5.
Retrieval metrics at a glance
| Metric | What it measures | Best for |
|---|---|---|
| Context Precision | Fraction of retrieved chunks that are relevant | Detecting noise in context window |
| Context Recall | Fraction of relevant chunks that were retrieved | Detecting missing context |
| Hit Rate@k | Any relevant doc in top-k? | Quick pass/fail signal per query |
| MRR | Rank of first relevant document | Single-answer retrieval quality |
| nDCG@k | Full ranked-list quality with discount | Embedding model benchmarking (BEIR/MTEB) |
2. Generation Metrics (LLM Evaluation)
Generation metrics evaluate the quality of the LLM's answer given the retrieved context. They split into two philosophies: reference-based (compare to a ground-truth answer) and reference-free (evaluate the answer against the context alone, without needing labelled answers).
RAGAS implements both families and uses an LLM judge for all semantic assessments.
2a. Faithfulness, Does the answer hallucinate?
Faithfulness is a critical generation metric for RAG. It measures whether every claim in the generated answer is supported by the retrieved context, for example, the answer does not introduce facts that are not present in the chunks.
RAGAS computes it by:
1. Using an LLM to decompose the generated answer into atomic statements (e.g., "The Eiffel Tower is 330 metres tall", "It was built in 1889").
2. For each statement, asking the judge LLM: "Is this statement supported by the provided context?"
3. Aggregating:
$$text{Faithfulness} = frac{text{statements supported by context}}{text{total statements in answer}}$$
A faithfulness score of 1.0 means every claim in the answer has a citation-equivalent in the context. A score below 0.8 should be investigated, the LLM is likely hallucinating beyond the retrieved chunks. This metric is reference-free: you do not need a ground-truth answer to compute it.
2b. Answer Relevancy, Does the answer address the question?
Answer Relevancy checks whether the generated answer is actually on-topic for the user's question, penalising responses that are technically correct but evasive, incomplete, or off-target.
The RAGAS approach is clever: it asks the LLM to generate $n$ hypothetical questions that the given answer would address, then measures the average cosine similarity between those synthetic questions and the original question:
$$text{Answer Relevancy} = frac{1}{n} sum_{i=1}^{n} cos(vec{q}_i^{text{gen}}, vec{q}^{text{original}})$$
If the generated answer closely addresses the original question, the reverse-engineered questions will resemble the original. If the answer drifts off-topic, the synthetic questions will diverge. This metric is also reference-free, no ground-truth answer needed.
2c. Context Utilisation
Context Utilisation measures whether the LLM actually used the retrieved context to generate its answer, as opposed to relying purely on parametric memory.
RAGAS computes it by extracting the sentences from the retrieved context that were actually used to support the answer, and dividing by the total context size:
$$text{Context Utilisation} = frac{text{sentences from context used in answer}}{text{total sentences in context}}$$
A low score here means one of two things: either the retrieved chunks were not relevant enough to be useful (retrieval problem), or the LLM ignored the context and answered from memory (generation problem). Cross-referencing with Faithfulness helps disambiguate.
2d. Answer Correctness, Reference-based accuracy
All the metrics above are reference-free, they evaluate the answer against the context, not against a ground-truth answer. Answer Correctness brings in the ground truth and measures how factually accurate the generated answer is relative to the expected answer.
RAGAS computes it as a weighted combination of:
- Semantic similarity between generated and reference answers (embedding cosine similarity).
- Factual overlap based on claim-level F1: how many factual claims from the reference are present in the generated answer, and vice versa.
$$text{Answer Correctness} = w_1 cdot F1_{text{factual}} + w_2 cdot cos(vec{a}_{text{gen}}, vec{a}_{text{ref}})$$
This is the metric closest to traditional NLP evaluation (like ROUGE or BERTScore), but it is far more semantically aware. It requires a labelled dataset with ground-truth answers, which is where RAGAS's testset generation becomes critical.
Generation metrics at a glance
| Metric | What it measures | Reference-free? | Hallucination signal? | Best for |
|---|---|---|---|---|
| Faithfulness | Are all answer claims supported by context? | Yes | Primary signal | Detecting hallucinations |
| Answer Relevancy | Does the answer address the question? | Yes | Indirect | Detecting evasive/off-topic answers |
| Context Utilisation | How much of the context was actually used? | Yes | Indirect | Diagnosing retrieval–generation mismatch |
| Answer Correctness | Is the answer factually correct vs ground truth? | Needs reference | Yes | End-to-end accuracy with labelled data |
3. Building an Evaluation Dataset with RAGAS
The main bottleneck in RAG evaluation is the evaluation dataset: you need (question, ground-truth answer, relevant chunks) triples. Labelling these manually is expensive and slow. RAGAS includes a synthetic testset generator that automates this using your own document corpus.
RAGAS synthetic testset generation pipeline
3a. Question Types
RAGAS generates three types of questions to stress-test different parts of your pipeline:
- Simple questions, answered directly from a single chunk. Tests whether basic retrieval and short-form generation work correctly.
- Multi-hop questions, require synthesising information across two or more chunks. Stress-tests both the retriever (must surface multiple relevant chunks) and the LLM (must integrate disparate pieces of context).
- Abstract/reasoning questions, require inference beyond what is literally stated in the text. Tests higher-order reasoning and the LLM's ability to use context as evidence rather than just parroting it.
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.document_loaders import DirectoryLoader
# ── Load your documents ──────────────────────────────────────────────────────
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
# ── Configure the LLM and embeddings for generation ──────────────────────────
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# ── Create the generator ─────────────────────────────────────────────────────
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_emb)
# ── Generate the testset ─────────────────────────────────────────────────────
# Distributions control the mix of question types
from ragas.testset.synthesizers import (
SingleHopSpecificQuerySynthesizer,
MultiHopAbstractQuerySynthesizer,
MultiHopSpecificQuerySynthesizer,
)
testset = generator.generate_with_langchain_docs(
documents,
testset_size=50,
query_distribution=[
(SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
(MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
(MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
],
)
# ── Inspect the dataset ──────────────────────────────────────────────────────
df = testset.to_pandas()
print(df.columns.tolist())
# ['user_input', 'reference', 'reference_contexts', 'synthesizer_name']
print(df.head(3))3b. The EvaluationDataset Schema
Every row in the generated testset (and in any manually constructed evaluation dataset) follows a standard schema that RAGAS metrics expect:
- user_input, the question the user asked.
- retrieved_contexts, the list of chunks returned by your retriever at evaluation time (not the ground-truth chunks).
- response, the answer generated by your LLM given the retrieved contexts.
- reference, the ground-truth answer (required only for reference-based metrics like Answer Correctness).
- reference_contexts, the ground-truth relevant chunks (required for Context Recall).
The key insight is that retrieved_contexts and response are filled in at evaluation time by running your actual RAG pipeline. The generator only produces user_input, reference, and reference_contexts, the ground-truth side. This separation means you can test the same dataset against multiple pipeline configurations.
EvaluationDataset fields and which metrics use them
| Field | Source | Used by |
|---|---|---|
user_input | Generated or manual | All metrics |
retrieved_contexts | Your RAG pipeline (at eval time) | Faithfulness, Context Precision, Context Utilisation |
response | Your RAG pipeline (at eval time) | Faithfulness, Answer Relevancy, Answer Correctness |
reference | Generated or manual ground truth | Answer Correctness |
reference_contexts | Generated or manual ground truth | Context Recall |
4. Running a Full RAGAS Evaluation
Once you have a dataset and have run your RAG pipeline to fill in retrieved_contexts and response, evaluating is a single function call.
from ragas import evaluate, EvaluationDataset
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
# ── Build your evaluation dataset ────────────────────────────────────────────
# In practice, run your RAG pipeline over testset rows to fill retrieved_contexts + response
samples = [
{
"user_input": "What is the capital of France?",
"retrieved_contexts": ["Paris is the capital and largest city of France."],
"response": "The capital of France is Paris.",
"reference": "Paris",
"reference_contexts": ["Paris is the capital and largest city of France."],
},
# ... more rows
]
dataset = EvaluationDataset.from_list(samples)
# ── Configure the judge LLM ───────────────────────────────────────────────────
judge_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
judge_emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# ── Run evaluation ────────────────────────────────────────────────────────────
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
],
llm=judge_llm,
embeddings=judge_emb,
)
print(results)
# Output: {'faithfulness': 0.94, 'answer_relevancy': 0.88,
# 'context_precision': 0.91, 'context_recall': 0.86,
# 'answer_correctness': 0.79}
# Per-sample breakdown
df = results.to_pandas()
print(df[["user_input", "faithfulness", "answer_relevancy", "context_precision"]])5. Interpreting Results, What to Fix First
RAGAS scores alone don't tell you what to fix. The value comes from reading them together as a diagnostic matrix:
Diagnostic patterns, what metric combinations reveal
| Pattern | Likely root cause | What to try |
|---|---|---|
| Low Context Recall, high Faithfulness | Retriever misses relevant chunks | Better embeddings, hybrid search, larger top-k, enriched chunks |
| High Context Recall, low Faithfulness | LLM hallucinates despite good context | Stronger system prompt, better base model, reduce context noise |
| Low Context Precision, low Faithfulness | Irrelevant chunks distract the LLM | Reranker, smaller top-k, semantic chunking |
| High all retrieval metrics, low Answer Relevancy | LLM answers off-topic or evasively | Prompt engineering, instruction tuning |
| Low Answer Correctness only | Right idea but factual errors in generation | Better LLM, reference-aware prompting |
| Low Context Utilisation | LLM ignores retrieved context entirely | Stronger retrieval instruction in system prompt, check context formatting |
A mature RAG system is one that is continuously evaluated, not just at launch. RAGAS's synthetic testset generation makes it feasible to maintain a living evaluation suite that grows with your corpus. Run it on every significant change to your chunking strategy, embedding model, retriever configuration, or prompt, and treat a regression in Faithfulness or Context Recall as a breaking change, the same way you would treat a failing unit test.
Interesting Papers & Resources
- RAGAS: Automated Evaluation of Retrieval Augmented Generation (Es et al., 2023)
- ARES: An Automated Evaluation Framework for RAG Systems (Saad-Falcon et al., 2023)
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of IR Models (Thakur et al., 2021)
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)
- RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge (Liu et al., 2023)
- TruLens – Alternative RAG evaluation framework (TruEra)
- DeepEval – Unit-testing framework for LLM outputs