Evaluating Retrieval-Augmented Generation Systems
and Fine-Tuned Large Language Models
An introduction with examples, mathematical derivation, and discussion of common evaluation metrics.
Introduction & Foundations
1.1What is Retrieval-Augmented Generation?
Retrieval-Augmented Generation (RAG) denotes an architectural pattern that extends the static knowledge base of a Large Language Model (LLM) with dynamic access to external document collections. Formally, the process is described as a three-stage pipeline:
Here $q$ denotes the query, $\mathcal{D}$ the document collection, $\mathcal{R}$ the retriever (e.g., a dense bi-encoder with cosine similarity), and $\mathcal{G}$ the generator (the LLM). The composition $\mathcal{G} \circ \mathcal{R}$ unifies parametric knowledge (in the model weights) with non-parametric knowledge (in the index) โ a concept discussed in the literature as hybrid memory (Lewis et al., 2020).
๐ง Application example: RAG in manufacturing engineering
Query: "How do I optimize cutting speed for CNC milling of aluminum?"
Retrieved documents (top-2):
- $d_1$: "For AlMg3 we recommend cutting speeds of 200โ300 m/min..."
- $d_2$: "The feed rate for aluminum should be 0.1โ0.3 mm/tooth..."
Generated answer: "For CNC milling of aluminum I recommend 200โ300 m/min with HSS cutters and a feed rate of 0.1โ0.3 mm/tooth."
Note: answer quality depends both on retrieval quality and on generator faithfulness. Both components require separate metrics.
1.2RAG vs. Fine-Tuning: A conceptual comparison
๐ RAG approach
- Knowledge externally indexed (non-parametric)
- Updates possible without retraining
- Source citation and traceability
- Higher inference latency (retrieval step)
- Suited to fact-intensive domains
๐ฏ Fine-tuning approach
- Knowledge in model weights (parametric)
- Updates require retraining
- No intrinsic source attribution
- Lower inference latency
- Suited to style, format, and domain adaptation
๐ก Design decision in practice
The two approaches are complementary, not competing. Empirical studies (Ovadia et al., 2024) show that RAG is superior for fact-based tasks, while fine-tuning is preferable for stylistic adaptation and latency-critical applications. The hybrid approach โ fine-tuning on domain style plus RAG for current facts โ is state-of-the-art in production systems.
Retrieval Metrics
Retrieval metrics evaluate the quality of the component $\mathcal{R}$. They differ in whether they (a) consider pure set relations (precision, recall), (b) account for position (MRR), or (c) incorporate graded relevance (NDCG).
2.1Precision@K and Recall@K
Here $\mathcal{R}_K$ is the set of top-$K$ retrieved documents and $\mathcal{V}$ is the set of all relevant documents in the corpus (ground truth).
๐ Statistical interpretation (click to expand)
Precision@K is the empirical estimator of the conditional probability $P(\text{relevant} \mid \text{retrieved in top-}K)$. Recall@K estimates $P(\text{retrieved in top-}K \mid \text{relevant})$. Both are asymmetric: a system with high precision but low recall is "precise but incomplete"; conversely, it returns many hits but with a high noise fraction.
The F1 score $F_1 = 2 \cdot \frac{P \cdot R}{P+R}$ forms the harmonic mean and is more pessimistic than the arithmetic mean (it penalizes imbalance more strongly).
โ๏ธ Worked example: Precision@5 โ Thiele-Small parameters
Query: "How do I determine the Thiele-Small parameters of a loudspeaker?"
| Rank | Document | Relevant? |
|---|---|---|
| 1 | "Thiele-Small measurement via impedance analysis" | โ |
| 2 | "Enclosure design for bass loudspeakers" | โ |
| 3 | "Electromechanical modeling" | โ |
| 4 | "Crossover design" | โ |
| 5 | "Determining $Q_{ts}$, $V_{as}$ and $f_s$" | โ |
3 (ranks 1, 3, 5)๐ฎ Interactive calculator: Precision & Recall
Click on the ranks to mark them as "relevant". Metrics are computed live.
2.2Mean Reciprocal Rank (MRR)
$\operatorname{rank}_i$ denotes the position of the first relevant document for query $q_i$. MRR is especially meaningful when the user typically reads only the first result (search UIs, FAQ bots).
โ๏ธ Worked example: MRR over 4 queries
| Query | First relevant rank | Reciprocal |
|---|---|---|
| $q_1$: "Thiele-Small parameters" | 1 | $1/1 = 1.000$ |
| $q_2$: "FEM simulation" | 3 | $1/3 = 0.333$ |
| $q_3$: "ISO tolerance calculation" | 2 | $1/2 = 0.500$ |
| $q_4$: "Heat treatment of steel" | 1 | $1/1 = 1.000$ |
2.3Normalized Discounted Cumulative Gain (NDCG)
NDCG is the only one of the metrics presented here that admits graded relevance (e.g., $\operatorname{rel}_i \in \{0,1,2,3\}$): "not relevant", "marginal", "relevant", "highly relevant". The logarithm in the denominator implements position discounting: later hits contribute less to the overall score.
โ๏ธ Worked example: NDCG@5 โ AOI defect detection
| Rank $i$ | Document | $\operatorname{rel}_i$ | $2^{\operatorname{rel}_i}-1$ | $\log_2(i+1)$ | Gain |
|---|---|---|---|---|---|
| 1 | "Deep learning for AOI" | 3 | 7 | 1.000 | 7.000 |
| 2 | "Introduction to QA" | 0 | 0 | 1.585 | 0.000 |
| 3 | "AOI illumination" | 2 | 3 | 2.000 | 1.500 |
| 4 | "Statistics" | 1 | 1 | 2.322 | 0.431 |
| 5 | "CNN defect classification" | 3 | 7 | 2.585 | 2.708 |
2.4Confusion matrix & precision-recall curve
A complete diagnosis of retrieval behavior requires examining the full trade-off curve, not just a single point:
โ ๏ธ Critical limitation
All four metrics presuppose that ground-truth relevance is known โ in practice an annotation-expensive process. In large corpora, therefore, only a sample is typically annotated, leading to wide confidence intervals. Furthermore, semantically similar but lexically different documents are often incorrectly marked as irrelevant (annotator bias).
RAGAS Framework: Generation Metrics
The RAGAS framework is reference-free: it requires no human-annotated gold answers, instead using a "judge LLM" to operationalize the evaluation criteria. The four main metrics systematically cover both pipeline components:
| Metric | Compares | Quantifies | Component |
|---|---|---|---|
| Faithfulness | Answer vs. context | Faithfulness / hallucination | Generator $\mathcal{G}$ |
| Answer relevance | Answer vs. query | Topic alignment | Generator $\mathcal{G}$ |
| Context precision | Context vs. query | Signal/noise in context | Retriever $\mathcal{R}$ |
| Context recall | Context vs. gold answer | Completeness | Retriever $\mathcal{R}$ |
3.1Faithfulness
Here $S(\hat y)$ decomposes the generated answer into atomic claims, each individually checked against the context $c$. A claim is considered "supported" if it is logically derivable from $c$.
โ๏ธ Worked example: Faithfulness โ materials engineering
Context $c$: "42CrMo4 is a tempered steel with a tensile strength of 900โ1100 MPa. Hardness 28โ34 HRC. Weldable with preheating."
Answer $\hat y$: "42CrMo4 has a tensile strength of 900โ1100 MPa, a hardness of 28โ34 HRC, is weldable with preheating, and is highly corrosion-resistant."
| # | Atomic claim | Supported by $c$? |
|---|---|---|
| 1 | Tensile strength 900โ1100 MPa | โ |
| 2 | Hardness 28โ34 HRC | โ |
| 3 | Weldable with preheating | โ |
| 4 | Corrosion-resistant | โ Hallucination! |
๐ฎ Interactive calculator: Faithfulness
Set the number of supported and total claims โ the metric updates live.
3.2Answer Relevance
From the generated answer $\hat y$, $n$ synthetic questions $\tilde q_j$ are back-generated. The cosine similarity of their embeddings to the original query $q$ measures whether the answer addresses the right question.
โ๏ธ Worked example: Answer Relevance โ PID controller
Query $q$: "How do I tune a PID controller?"
| Synthetic question $\tilde q_j$ | Cosine similarity |
|---|---|
| "How do I tune PID with Ziegler-Nichols?" | 0.92 |
| "What are $K_p$, $T_i$, $T_d$?" | 0.68 |
| "Who invented the PID controller?" | 0.25 |
3.3Context Recall
The fraction of atomic claims from the gold answer $y^*$ that is covered by the retrieved context $c$. Low CR → retriever delivers incomplete context → the generator cannot deliver a complete answer even with perfect faithfulness.
โ๏ธ Worked example: Context Recall โ welding defects
Gold answer $y^*$ covers 6 claims: porosity, undercut, lack of fusion, cracks, spatter, root defects
Retrieved context $c$ contains: porosity, undercut, lack of fusion, spatter (4 of 6)
๐ก Diagnostic use of the RAGAS metrics
The four metrics enable causal diagnosis: low context recall โ retriever problem. Low faithfulness with good context recall โ generator hallucinates. Low answer relevance โ generator answers a different question. This modularity is the central advance over monolithic end-to-end metrics.
Fine-Tuning Metrics
Classical text-generation metrics evaluate how well a generated hypothesis $\hat y$ matches one (or more) reference answer(s) $y^*$. They differ primarily in granularity (n-grams vs. tokens) and in representation (lexical vs. semantic).
4.1BLEU (Bilingual Evaluation Understudy)
BLEU geometrically averages n-gram precisions $p_n$ and multiplies by a brevity penalty (BP) that penalizes overly short hypotheses. Standard weights: $w_n = 1/N$ with $N=4$.
โ๏ธ Worked example: BLEU-1 and BLEU-2
Reference: "The motor reaches a speed of 3000 revolutions per minute"
Generated: "The motor has a speed of 3000 revolutions per minute"
4.2ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE-N is recall-oriented (the denominator is based on the reference, not the hypothesis). ROUGE-L uses the longest common subsequence (LCS) and is robust to word-order changes.
4.3BERTScore
Instead of counting exact token matches, BERTScore computes the maximum cosine similarity of contextualized embeddings (e.g., from BERT or RoBERTa). This solves the fundamental problem of lexical metrics: semantic equivalence under lexical variation.
โ๏ธ BERTScore detects synonyms
Reference: "The transmission transmits a torque"
Generated: "The gearbox forwards a torque"
Lexical metrics: BLEU is low (different words). BERTScore: high, because "transmits" and "forwards" are semantically equivalent in this context.
โ ๏ธ Known weaknesses of classical metrics
- BLEU/ROUGE: correlate weakly with human judgment in open-domain generation (Reiter, 2018; Sai et al., 2022). With single reference answers the variance is extremely high.
- BERTScore: depends on the encoder model used. Cross-lingual transfer and domain drift can distort values.
- All three: can be subject to Goodhart's Law โ "When a measure becomes a target, it ceases to be a good measure."
Statistical Evaluation & Significance
A single metric value on a test set is a point estimate with sampling variance. Without quantifying this uncertainty, statements like "system A is better than system B" are scientifically unfounded. This chapter introduces three canonical tools: confidence intervals, the bootstrap, and inter-annotator agreement.
5.1Confidence intervals for metrics
For a mean $\bar X$ over $n$ queries with sample standard deviation $s$, the central limit theorem gives approximately:
โ๏ธ Worked example: CI for MRR
Over 100 test queries, $\overline{\text{MRR}} = 0.708$ with $s = 0.28$. The 95% CI is:
5.2Bootstrap resampling
For non-normal metrics (e.g., NDCG, BLEU), the non-parametric bootstrap procedure (Efron, 1979) is preferable. From the test set, $B$ (typically 1000) random samples are drawn with replacement; quantiles are then read directly from the distribution of the $B$ metric values.
5.3Inter-annotator agreement (Cohen's $\kappa$)
When relevance is annotated by humans, the reliability of the annotation itself must be checked. Cohen's kappa corrects observed agreement by the agreement expected by chance:
| $\kappa$ value | Interpretation (Landis & Koch, 1977) |
|---|---|
| < 0.00 | Worse than chance |
| 0.01 โ 0.20 | Slight agreement |
| 0.21 โ 0.40 | Fair agreement |
| 0.41 โ 0.60 | Moderate agreement |
| 0.61 โ 0.80 | Substantial agreement |
| 0.81 โ 1.00 | Almost perfect |
๐ก Scientific reporting practice
It is now standard practice in conference papers (ACL, EMNLP, NeurIPS) to report every primary metric with a confidence interval plus a significance test against the baseline (e.g., paired bootstrap test). An improvement of 0.3 BLEU points without a significance statement is generally not considered publishable.
Interactive Metric Comparison
The following live demo shows the fundamental behavior of BLEU, ROUGE, and BERTScore on five classes of text variation. Choose a variation to see how the three metrics react in qualitatively different ways.
๐ฎ Metric playground
Pick a variation. The values are based on typical empirical magnitudes from Sai et al. (2022).
Identical hypothesis and reference โ all three metrics return their maximum value.
โ ๏ธ The contradiction case
Pay particular attention to the last variant: when the hypothesis nearly fully repeats the reference lexically but states the opposite semantically (e.g., "does NOT reach 3000โฆ"), BLEU stays artificially high. This is a well-known failure mode of all n-gram metrics (Freitag et al., 2022) and one of the central motivations for developing NLI-based metrics such as BLEURT.
Decision Guide
The choice of the "right" metric depends on application context. The table below offers pragmatic recommendations for typical engineering scenarios:
| Use case | Primary metric | Secondary metric | Reasoning |
|---|---|---|---|
| Safety data sheets (RAG) | Faithfulness | Context recall | Hallucinations are safety-critical |
| Standards / regulation search | Context recall | NDCG@10 | Completeness over precision |
| FAQ bot / search UI | MRR | Precision@1 | First result dominates UX |
| Machine translation | BLEU + chrF | BERTScore | Established standard metrics |
| Summarization | ROUGE-L | BERTScore | Recall orientation essential |
| Code generation | Pass@K (functional test) | BLEU | Correctness is binary |
| Open-domain QA | RAGAS profile (all 4) | Human eval | Multidimensional quality |
7.1Anti-patterns: common evaluation mistakes
โ ๏ธ What not to do
- One-dimensional optimization: looking at only one metric ignores the multidimensional nature of quality.
- Metric hacking (Goodhart's Law): directly optimizing on a metric often leads to quality loss on other axes.
- Neglecting human evaluation: automatic metrics are proxies; final validation requires sample-based human evaluation.
- Forgetting the retrieval component: in RAG, poor end-to-end quality is often wrongly attributed to the generator when in fact retrieval is failing.
- Missing statistical significance: point estimates without confidence intervals are not informative (see Ch. 5).
- Test set contamination: when training and test data overlap, all metrics become unreliable.
Python Implementation
The reference implementations below show the core logic. For production systems, use of established libraries is recommended (RAGAS, evaluate, bert_score, scikit-learn).
8.1Custom implementation of retrieval metrics
import numpy as np
from typing import List, Sequence
def precision_at_k(retrieved: Sequence[str], relevant: set, k: int) -> float:
"""Fraction of top-K hits that are relevant."""
top_k = retrieved[:k]
return len([d for d in top_k if d in relevant]) / k
def recall_at_k(retrieved: Sequence[str], relevant: set, k: int) -> float:
"""Fraction of all relevant items that appear in the top-K."""
top_k = retrieved[:k]
return len([d for d in top_k if d in relevant]) / len(relevant)
def reciprocal_rank(retrieved: Sequence[str], relevant: set) -> float:
"""1/rank of the first relevant hit (0 if none)."""
for i, doc in enumerate(retrieved, 1):
if doc in relevant:
return 1.0 / i
return 0.0
def mean_reciprocal_rank(all_retrieved, all_relevant) -> float:
return float(np.mean([
reciprocal_rank(r, rel) for r, rel in zip(all_retrieved, all_relevant)
]))
def ndcg_at_k(relevance_grades: Sequence[float], k: int) -> float:
"""NDCG with graded relevance."""
def dcg(scores):
return sum((2**s - 1) / np.log2(i + 2) for i, s in enumerate(scores[:k]))
actual = dcg(relevance_grades)
ideal = dcg(sorted(relevance_grades, reverse=True))
return actual / ideal if ideal > 0 else 0.0
8.2Bootstrap confidence interval
def bootstrap_ci(scores: np.ndarray, n_boot: int = 1000,
alpha: float = 0.05, seed: int = 42) -> tuple:
"""Non-parametric 95% CI via bootstrap resampling."""
rng = np.random.default_rng(seed)
n = len(scores)
boot_means = np.empty(n_boot)
for b in range(n_boot):
idx = rng.integers(0, n, size=n)
boot_means[b] = scores[idx].mean()
lo = np.quantile(boot_means, alpha / 2)
hi = np.quantile(boot_means, 1 - alpha / 2)
return float(scores.mean()), float(lo), float(hi)
8.3RAGAS, BLEU, ROUGE, BERTScore (libraries)
# RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
results = evaluate(
dataset=eval_ds,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
# BLEU / ROUGE
from evaluate import load
bleu = load("bleu")
rouge = load("rouge")
bleu_score = bleu.compute(predictions=hyps, references=[[r] for r in refs])
rouge_score = rouge.compute(predictions=hyps, references=refs)
# BERTScore (English)
from bert_score import score
P, R, F1 = score(hyps, refs, lang="en", model_type="roberta-large")
Self-Test
Six multiple-choice questions for self-assessment. Click an answer to receive immediate feedback.
Exercises
๐ Exercise 1: Precision & Recall
10 documents are retrieved; those at ranks 1, 3, 5, 8, 10 are relevant. The corpus contains 15 relevant documents in total.
Show solution
Precision@5 = $3/5 = 0.60$ | Precision@10 = $5/10 = 0.50$
Recall@5 = $3/15 = 0.20$ | Recall@10 = $5/15 \approx 0.333$
F1@10 = $\frac{2 \cdot 0.5 \cdot 0.333}{0.5 + 0.333} \approx 0.40$
๐ Exercise 2: MRR over 5 queries
First relevant ranks: $q_1 \to 1$, $q_2 \to 4$, $q_3 \to 2$, $q_4 \to 1$, $q_5 \to 3$
Show solution
$\text{MRR} = \dfrac{1 + 0.25 + 0.5 + 1 + 0.333}{5} = \dfrac{3.083}{5} \approx 0.617$
๐ Exercise 3: Faithfulness
Context: "Transformer: 230 V โ 24 V, ratio 9.6 : 1, power 500 W". The answer additionally mentions "95% efficiency" (not present in the context).
Show solution
3 supported claims + 1 hallucination โ $F = 3/4 = 0.75$
๐ Exercise 4: NDCG with graded relevance
Top-5 relevance grades: $(2, 3, 0, 1, 2)$. Compute NDCG@5.
Show solution
DCG = $3 + \frac{7}{1.585} + 0 + \frac{1}{2.322} + \frac{3}{2.585} = 3 + 4.416 + 0.431 + 1.160 = 9.007$
Ideal ordering $(3,2,2,1,0)$: IDCG = $7 + \frac{3}{1.585} + \frac{3}{2.000} + \frac{1}{2.322} + 0 = 7 + 1.893 + 1.500 + 0.431 = 10.824$
$\text{NDCG@}5 = 9.007 / 10.824 \approx 0.832$
๐ Exercise 5: Confidence interval
On a test set of $n=50$ queries the mean BERTScore is $\bar X = 0.82$ with $s = 0.11$. Compute the 95% CI.
Show solution
$\text{SE} = 0.11/\sqrt{50} \approx 0.0156$
Half-width: $1.96 \cdot 0.0156 \approx 0.0305$
$\text{CI}_{95\%} = [0.789,\; 0.851]$
Quick Reference & Bibliography
11.1Metric overview
| Metric | Formula (short) | Range | Main use |
|---|---|---|---|
| Precision@K | $|\mathcal{R}_K \cap \mathcal{V}| / K$ | $[0,1]$ | Search |
| Recall@K | $|\mathcal{R}_K \cap \mathcal{V}| / |\mathcal{V}|$ | $[0,1]$ | Completeness |
| MRR | $\frac{1}{|Q|}\sum 1/\text{rank}_i$ | $[0,1]$ | First relevant hit |
| NDCG@K | DCG / IDCG | $[0,1]$ | Graded relevance |
| Faithfulness | $|\text{supported}|/|\text{total}|$ | $[0,1]$ | Hallucination detection |
| BLEU | $\text{BP}\cdot\exp(\sum w_n \log p_n)$ | $[0,1]$ | Translation |
| ROUGE-N/L | n-gram / LCS recall | $[0,1]$ | Summarization |
| BERTScore | $F_1$ of embedding cosines | $[0,1]$ | Semantic evaluation |
๐ Bibliography
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demonstrations, pp. 150โ158.
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002, pp. 311โ318.
- Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop on Text Summarization, pp. 74โ81.
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
- Jรคrvelin, K., & Kekรคlรคinen, J. (2002). Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, 20(4), pp. 422โ446.
- Sai, A. B., Mohankumar, A. K., & Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Computing Surveys, 55(2).
- Reiter, E. (2018). A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3), pp. 393โ401.
- Ovadia, O., et al. (2024). Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. EMNLP 2024.
- Freitag, M., et al. (2022). Results of the WMT22 Metrics Shared Task. WMT 2022.
- Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), pp. 1โ26.
- Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), pp. 159โ174.
- Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation. ACL 2020.