Evaluating RAG Systems: Retrieval Metrics, LLM Metrics and RAGAS

Building a RAG pipeline is only half the job. The harder question is: how good is it? Without a rigorous evaluation framework, you are flying blind, unable to tell whether a prompt change improved the system, whether a new embedding model helps, or whether your chunking strategy is losing relevant context.

RAG evaluation splits naturally into two distinct layers:

- Retrieval metrics, did the retrieval stage find the right documents?

- Generation metrics, given the retrieved context, did the LLM produce a correct, faithful, and complete answer?

RAGAS (Retrieval-Augmented Generation Assessment, arxiv:2309.15217) is the de-facto open-source framework for evaluating both layers end-to-end. It uses LLM-as-a-judge internally for the metrics that require semantic understanding, and provides a clean Python API over a standardised EvaluationDataset.


1. Retrieval Metrics

Retrieval metrics evaluate whether the chunks surfaced by your retrieval stage are the right ones. They operate over a set of retrieved documents $R$ and a set of ground-truth relevant documents $G$.

These metrics assume you have labelled data, a set of questions each annotated with the document chunks that are actually relevant. RAGAS can help generate this dataset automatically (see Section 4).

1a. Context Precision and Context Recall

The two most fundamental retrieval metrics are precision and recall, adapted for the RAG context window:

Context Precision measures what fraction of the retrieved chunks are actually relevant to answering the question. A high precision means the LLM is not distracted by irrelevant noise:

$$text{Context Precision} = frac{|R cap G|}{|R|}$$

Context Recall measures what fraction of the ground-truth relevant chunks were actually retrieved. A high recall means the retriever is not missing key information the LLM needs:

$$text{Context Recall} = frac{|R cap G|}{|G|}$$

The classic precision-recall trade-off applies here too: retrieving more chunks increases recall but dilutes precision. Tuning the top-$k$ parameter is essentially navigating this trade-off.

In RAGAS, both metrics are computed with LLM-as-judge: rather than exact chunk ID matching (which requires perfect labelling), the judge LLM determines whether each retrieved chunk is relevant to the question and whether the ground-truth answer's claims are covered by the retrieved context. This makes them robust to paraphrase and partial overlaps.

1b. MRR and nDCG, Ranking Quality

Precision and recall treat all retrieved documents equally. In practice, the order matters enormously: the first chunk in the context window is more likely to be read carefully by the LLM than the last one. Ranking-aware metrics capture this.

Mean Reciprocal Rank (MRR) measures how high the first relevant document appears in the ranked results. For a set of queries $Q$:

$$text{MRR} = frac{1}{|Q|} sum_{i=1}^{|Q|} frac{1}{text{rank}_i}$$

where $text{rank}_i$ is the position of the first relevant document for query $i$. MRR = 1.0 means the first result is always correct; MRR = 0.5 means the first relevant result is typically at position 2.

Normalised Discounted Cumulative Gain (nDCG) generalises MRR to multiple relevant documents and supports graded relevance. It compares the ranking produced by your retriever against the ideal ranking:

$$text{nDCG@k} = frac{text{DCG@k}}{text{IDCG@k}}, quad text{where} quad text{DCG@k} = sum_{i=1}^{k} frac{text{rel}_i}{log_2(i+1)}$$

nDCG is the standard metric on the BEIR and MTEB retrieval benchmarks, the numbers you see on leaderboards for embedding models are nDCG@10 scores.

1c. Hit Rate

Hit Rate@k is the simplest retrieval metric: for what fraction of queries does at least one relevant document appear in the top-$k$ results?

$$text{Hit Rate@k} = frac{1}{|Q|} sum_{i=1}^{|Q|} mathbb{1}[G_i cap R_i^{(k)} neq emptyset]$$

It is a binary, all-or-nothing metric, it does not care whether there are one or ten relevant documents in the top-$k$, only that at least one is there. This makes it easy to interpret and quick to compute. Hit Rate@5 = 0.87 means 87% of queries had at least one relevant chunk in the top-5.

Retrieval metrics at a glance

MetricWhat it measuresBest for
Context PrecisionFraction of retrieved chunks that are relevantDetecting noise in context window
Context RecallFraction of relevant chunks that were retrievedDetecting missing context
Hit Rate@kAny relevant doc in top-k?Quick pass/fail signal per query
MRRRank of first relevant documentSingle-answer retrieval quality
nDCG@kFull ranked-list quality with discountEmbedding model benchmarking (BEIR/MTEB)

2. Generation Metrics (LLM Evaluation)

Generation metrics evaluate the quality of the LLM's answer given the retrieved context. They split into two philosophies: reference-based (compare to a ground-truth answer) and reference-free (evaluate the answer against the context alone, without needing labelled answers).

RAGAS implements both families and uses an LLM judge for all semantic assessments.

2a. Faithfulness, Does the answer hallucinate?

Faithfulness is a critical generation metric for RAG. It measures whether every claim in the generated answer is supported by the retrieved context, for example, the answer does not introduce facts that are not present in the chunks.

RAGAS computes it by:

1. Using an LLM to decompose the generated answer into atomic statements (e.g., "The Eiffel Tower is 330 metres tall", "It was built in 1889").

2. For each statement, asking the judge LLM: "Is this statement supported by the provided context?"

3. Aggregating:

$$text{Faithfulness} = frac{text{statements supported by context}}{text{total statements in answer}}$$

A faithfulness score of 1.0 means every claim in the answer has a citation-equivalent in the context. A score below 0.8 should be investigated, the LLM is likely hallucinating beyond the retrieved chunks. This metric is reference-free: you do not need a ground-truth answer to compute it.

2b. Answer Relevancy, Does the answer address the question?

Answer Relevancy checks whether the generated answer is actually on-topic for the user's question, penalising responses that are technically correct but evasive, incomplete, or off-target.

The RAGAS approach is clever: it asks the LLM to generate $n$ hypothetical questions that the given answer would address, then measures the average cosine similarity between those synthetic questions and the original question:

$$text{Answer Relevancy} = frac{1}{n} sum_{i=1}^{n} cos(vec{q}_i^{text{gen}}, vec{q}^{text{original}})$$

If the generated answer closely addresses the original question, the reverse-engineered questions will resemble the original. If the answer drifts off-topic, the synthetic questions will diverge. This metric is also reference-free, no ground-truth answer needed.

2c. Context Utilisation

Context Utilisation measures whether the LLM actually used the retrieved context to generate its answer, as opposed to relying purely on parametric memory.

RAGAS computes it by extracting the sentences from the retrieved context that were actually used to support the answer, and dividing by the total context size:

$$text{Context Utilisation} = frac{text{sentences from context used in answer}}{text{total sentences in context}}$$

A low score here means one of two things: either the retrieved chunks were not relevant enough to be useful (retrieval problem), or the LLM ignored the context and answered from memory (generation problem). Cross-referencing with Faithfulness helps disambiguate.

2d. Answer Correctness, Reference-based accuracy

All the metrics above are reference-free, they evaluate the answer against the context, not against a ground-truth answer. Answer Correctness brings in the ground truth and measures how factually accurate the generated answer is relative to the expected answer.

RAGAS computes it as a weighted combination of:

- Semantic similarity between generated and reference answers (embedding cosine similarity).

- Factual overlap based on claim-level F1: how many factual claims from the reference are present in the generated answer, and vice versa.

$$text{Answer Correctness} = w_1 cdot F1_{text{factual}} + w_2 cdot cos(vec{a}_{text{gen}}, vec{a}_{text{ref}})$$

This is the metric closest to traditional NLP evaluation (like ROUGE or BERTScore), but it is far more semantically aware. It requires a labelled dataset with ground-truth answers, which is where RAGAS's testset generation becomes critical.

Generation metrics at a glance

MetricWhat it measuresReference-free?Hallucination signal?Best for
FaithfulnessAre all answer claims supported by context? Yes Primary signalDetecting hallucinations
Answer RelevancyDoes the answer address the question? Yes IndirectDetecting evasive/off-topic answers
Context UtilisationHow much of the context was actually used? Yes IndirectDiagnosing retrieval–generation mismatch
Answer CorrectnessIs the answer factually correct vs ground truth? Needs reference YesEnd-to-end accuracy with labelled data

3. Building an Evaluation Dataset with RAGAS

The main bottleneck in RAG evaluation is the evaluation dataset: you need (question, ground-truth answer, relevant chunks) triples. Labelling these manually is expensive and slow. RAGAS includes a synthetic testset generator that automates this using your own document corpus.

RAGAS synthetic testset generation pipeline

3a. Question Types

RAGAS generates three types of questions to stress-test different parts of your pipeline:

- Simple questions, answered directly from a single chunk. Tests whether basic retrieval and short-form generation work correctly.

- Multi-hop questions, require synthesising information across two or more chunks. Stress-tests both the retriever (must surface multiple relevant chunks) and the LLM (must integrate disparate pieces of context).

- Abstract/reasoning questions, require inference beyond what is literally stated in the text. Tests higher-order reasoning and the LLM's ability to use context as evidence rather than just parroting it.

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.document_loaders import DirectoryLoader

# ── Load your documents ──────────────────────────────────────────────────────
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# ── Configure the LLM and embeddings for generation ──────────────────────────
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# ── Create the generator ─────────────────────────────────────────────────────
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_emb)

# ── Generate the testset ─────────────────────────────────────────────────────
# Distributions control the mix of question types
from ragas.testset.synthesizers import (
    SingleHopSpecificQuerySynthesizer,
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
)

testset = generator.generate_with_langchain_docs(
    documents,
    testset_size=50,
    query_distribution=[
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
    ],
)

# ── Inspect the dataset ──────────────────────────────────────────────────────
df = testset.to_pandas()
print(df.columns.tolist())
# ['user_input', 'reference', 'reference_contexts', 'synthesizer_name']
print(df.head(3))

3b. The EvaluationDataset Schema

Every row in the generated testset (and in any manually constructed evaluation dataset) follows a standard schema that RAGAS metrics expect:

- user_input, the question the user asked.

- retrieved_contexts, the list of chunks returned by your retriever at evaluation time (not the ground-truth chunks).

- response, the answer generated by your LLM given the retrieved contexts.

- reference, the ground-truth answer (required only for reference-based metrics like Answer Correctness).

- reference_contexts, the ground-truth relevant chunks (required for Context Recall).

The key insight is that retrieved_contexts and response are filled in at evaluation time by running your actual RAG pipeline. The generator only produces user_input, reference, and reference_contexts, the ground-truth side. This separation means you can test the same dataset against multiple pipeline configurations.

EvaluationDataset fields and which metrics use them

FieldSourceUsed by
user_inputGenerated or manualAll metrics
retrieved_contextsYour RAG pipeline (at eval time)Faithfulness, Context Precision, Context Utilisation
responseYour RAG pipeline (at eval time)Faithfulness, Answer Relevancy, Answer Correctness
referenceGenerated or manual ground truthAnswer Correctness
reference_contextsGenerated or manual ground truthContext Recall

4. Running a Full RAGAS Evaluation

Once you have a dataset and have run your RAG pipeline to fill in retrieved_contexts and response, evaluating is a single function call.

from ragas import evaluate, EvaluationDataset
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# ── Build your evaluation dataset ────────────────────────────────────────────
# In practice, run your RAG pipeline over testset rows to fill retrieved_contexts + response
samples = [
    {
        "user_input": "What is the capital of France?",
        "retrieved_contexts": ["Paris is the capital and largest city of France."],
        "response": "The capital of France is Paris.",
        "reference": "Paris",
        "reference_contexts": ["Paris is the capital and largest city of France."],
    },
    # ... more rows
]

dataset = EvaluationDataset.from_list(samples)

# ── Configure the judge LLM ───────────────────────────────────────────────────
judge_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
judge_emb  = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# ── Run evaluation ────────────────────────────────────────────────────────────
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness,
    ],
    llm=judge_llm,
    embeddings=judge_emb,
)

print(results)
# Output: {'faithfulness': 0.94, 'answer_relevancy': 0.88,
#          'context_precision': 0.91, 'context_recall': 0.86,
#          'answer_correctness': 0.79}

# Per-sample breakdown
df = results.to_pandas()
print(df[["user_input", "faithfulness", "answer_relevancy", "context_precision"]])

5. Interpreting Results, What to Fix First

RAGAS scores alone don't tell you what to fix. The value comes from reading them together as a diagnostic matrix:

Diagnostic patterns, what metric combinations reveal

PatternLikely root causeWhat to try
Low Context Recall, high FaithfulnessRetriever misses relevant chunksBetter embeddings, hybrid search, larger top-k, enriched chunks
High Context Recall, low FaithfulnessLLM hallucinates despite good contextStronger system prompt, better base model, reduce context noise
Low Context Precision, low FaithfulnessIrrelevant chunks distract the LLMReranker, smaller top-k, semantic chunking
High all retrieval metrics, low Answer RelevancyLLM answers off-topic or evasivelyPrompt engineering, instruction tuning
Low Answer Correctness onlyRight idea but factual errors in generationBetter LLM, reference-aware prompting
Low Context UtilisationLLM ignores retrieved context entirelyStronger retrieval instruction in system prompt, check context formatting

A mature RAG system is one that is continuously evaluated, not just at launch. RAGAS's synthetic testset generation makes it feasible to maintain a living evaluation suite that grows with your corpus. Run it on every significant change to your chunking strategy, embedding model, retriever configuration, or prompt, and treat a regression in Faithfulness or Context Recall as a breaking change, the same way you would treat a failing unit test.

Interesting Papers & Resources