Guide: Evaluating RAG Systems and Fine-Tuned LLMs

01

Introduction & Foundations

1.1What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) denotes an architectural pattern that extends the static knowledge base of a Large Language Model (LLM) with dynamic access to external document collections. Formally, the process is described as a three-stage pipeline:

RAG Pipeline (formal definition)

$$\hat{y} = \mathcal{G}\bigl(q,\; \mathcal{R}(q;\, \mathcal{D})\bigr) \quad\text{with}\quad \mathcal{R}(q;\mathcal{D}) = \operatorname*{argtop\text{-}k}_{d \in \mathcal{D}} \;\operatorname{sim}(q, d)$$

Here $q$ denotes the query, $\mathcal{D}$ the document collection, $\mathcal{R}$ the retriever (e.g., a dense bi-encoder with cosine similarity), and $\mathcal{G}$ the generator (the LLM). The composition $\mathcal{G} \circ \mathcal{R}$ unifies parametric knowledge (in the model weights) with non-parametric knowledge (in the index) — a concept discussed in the literature as hybrid memory (Lewis et al., 2020).

📖 Scientific basis: Lewis et al. (2020) introduced RAG as an architecture. Es et al. (2024) provided RAGAS, today's most-cited method for reference-free evaluation. [EACL 2024]

🔧 Application example: RAG in manufacturing engineering

Query: "How do I optimize cutting speed for CNC milling of aluminum?"

Retrieved documents (top-2):

$d_1$: "For AlMg3 we recommend cutting speeds of 200–300 m/min..."
$d_2$: "The feed rate for aluminum should be 0.1–0.3 mm/tooth..."

Generated answer: "For CNC milling of aluminum I recommend 200–300 m/min with HSS cutters and a feed rate of 0.1–0.3 mm/tooth."

Note: answer quality depends both on retrieval quality and on generator faithfulness. Both components require separate metrics.

1.2RAG vs. Fine-Tuning: A conceptual comparison

🔄 RAG approach

Knowledge externally indexed (non-parametric)
Updates possible without retraining
Source citation and traceability
Higher inference latency (retrieval step)
Suited to fact-intensive domains

🎯 Fine-tuning approach

Knowledge in model weights (parametric)
Updates require retraining
No intrinsic source attribution
Lower inference latency
Suited to style, format, and domain adaptation

💡 Design decision in practice

The two approaches are complementary, not competing. Empirical studies (Ovadia et al., 2024) show that RAG is superior for fact-based tasks, while fine-tuning is preferable for stylistic adaptation and latency-critical applications. The hybrid approach — fine-tuning on domain style plus RAG for current facts — is state-of-the-art in production systems.

02

Retrieval Metrics

Retrieval metrics evaluate the quality of the component $\mathcal{R}$. They differ in whether they (a) consider pure set relations (precision, recall), (b) account for position (MRR), or (c) incorporate graded relevance (NDCG).

2.1Precision@K and Recall@K

Precision & Recall at cutoff K

$$\text{Precision@}K = \frac{|\mathcal{R}_K \cap \mathcal{V}|}{K}, \qquad \text{Recall@}K = \frac{|\mathcal{R}_K \cap \mathcal{V}|}{|\mathcal{V}|}$$

Here $\mathcal{R}_K$ is the set of top-$K$ retrieved documents and $\mathcal{V}$ is the set of all relevant documents in the corpus (ground truth).

📐 Statistical interpretation (click to expand)

Precision@K is the empirical estimator of the conditional probability $P(\text{relevant} \mid \text{retrieved in top-}K)$. Recall@K estimates $P(\text{retrieved in top-}K \mid \text{relevant})$. Both are asymmetric: a system with high precision but low recall is "precise but incomplete"; conversely, it returns many hits but with a high noise fraction.

The F1 score $F_1 = 2 \cdot \frac{P \cdot R}{P+R}$ forms the harmonic mean and is more pessimistic than the arithmetic mean (it penalizes imbalance more strongly).

✏️ Worked example: Precision@5 — Thiele-Small parameters

Query: "How do I determine the Thiele-Small parameters of a loudspeaker?"

Rank	Document	Relevant?
1	"Thiele-Small measurement via impedance analysis"	✓
2	"Enclosure design for bass loudspeakers"	✗
3	"Electromechanical modeling"	✓
4	"Crossover design"	✗
5	"Determining $Q_{ts}$, $V_{as}$ and $f_s$"	✓

1

Relevant in top-5: 3 (ranks 1, 3, 5)

2

$\text{Precision@}5 = \dfrac{3}{5} = 0.60$

Precision@5 =0.60

🎮 Interactive calculator: Precision & Recall

Click on the ranks to mark them as "relevant". Metrics are computed live.

Total relevant in corpus $|\mathcal{V}|$: 8

Precision@10

—

Recall@10

—

F1@10

—

2.2Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank

$$\text{MRR} = \frac{1}{|Q|}\sum_{i=1}^{|Q|} \frac{1}{\operatorname{rank}_i}$$

$\operatorname{rank}_i$ denotes the position of the first relevant document for query $q_i$. MRR is especially meaningful when the user typically reads only the first result (search UIs, FAQ bots).

✏️ Worked example: MRR over 4 queries

Query	First relevant rank	Reciprocal
$q_1$: "Thiele-Small parameters"	1	$1/1 = 1.000$
$q_2$: "FEM simulation"	3	$1/3 = 0.333$
$q_3$: "ISO tolerance calculation"	2	$1/2 = 0.500$
$q_4$: "Heat treatment of steel"	1	$1/1 = 1.000$

1

$\text{MRR} = \frac{1}{4}(1.000 + 0.333 + 0.500 + 1.000) = 0.708$

MRR =0.708

2.3Normalized Discounted Cumulative Gain (NDCG)

📖 Scientific basis: Järvelin & Kekäläinen (2002). "Cumulated Gain-Based Evaluation of IR Techniques." ACM TOIS, 20(4), pp. 422–446. DOI: 10.1145/582415.582418

DCG · IDCG · NDCG

$$\text{DCG@}K = \sum_{i=1}^{K} \frac{2^{\operatorname{rel}_i} - 1}{\log_{2}(i+1)}, \qquad \text{NDCG@}K = \frac{\text{DCG@}K}{\text{IDCG@}K}$$

NDCG is the only one of the metrics presented here that admits graded relevance (e.g., $\operatorname{rel}_i \in \{0,1,2,3\}$): "not relevant", "marginal", "relevant", "highly relevant". The logarithm in the denominator implements position discounting: later hits contribute less to the overall score.

✏️ Worked example: NDCG@5 — AOI defect detection

Rank $i$	Document	$\operatorname{rel}_i$	$2^{\operatorname{rel}_i}-1$	$\log_2(i+1)$	Gain
1	"Deep learning for AOI"	3	7	1.000	7.000
2	"Introduction to QA"	0	0	1.585	0.000
3	"AOI illumination"	2	3	2.000	1.500
4	"Statistics"	1	1	2.322	0.431
5	"CNN defect classification"	3	7	2.585	2.708

1

$\text{DCG@}5 = 7.00 + 0 + 1.50 + 0.43 + 2.71 = 11.64$

2

Ideal ordering: $(3,3,2,1,0)$ → $\text{IDCG@}5 = 13.36$

3

$\text{NDCG@}5 = \dfrac{11.64}{13.36} = 0.871$

NDCG@5 =0.871

Fig. 1. Comparison of the four retrieval metrics for the shared application scenario.

2.4Confusion matrix & precision-recall curve

A complete diagnosis of retrieval behavior requires examining the full trade-off curve, not just a single point:

Fig. 2. Precision-recall curves of three hypothetical retrievers. The average precision (AP) corresponds to the area under the curve and is the position-aware generalization of precision.

Fig. 3. Confusion matrix for retrieval classification. TP = relevant + retrieved, FP = irrelevant + retrieved, FN = relevant + not retrieved, TN = irrelevant + not retrieved.

⚠️ Critical limitation

All four metrics presuppose that ground-truth relevance is known — in practice an annotation-expensive process. In large corpora, therefore, only a sample is typically annotated, leading to wide confidence intervals. Furthermore, semantically similar but lexically different documents are often incorrectly marked as irrelevant (annotator bias).

03

RAGAS Framework: Generation Metrics

📖 Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024 Demonstrations, pp. 150–158. [PDF]

The RAGAS framework is reference-free: it requires no human-annotated gold answers, instead using a "judge LLM" to operationalize the evaluation criteria. The four main metrics systematically cover both pipeline components:

Metric	Compares	Quantifies	Component
Faithfulness	Answer vs. context	Faithfulness / hallucination	Generator $\mathcal{G}$
Answer relevance	Answer vs. query	Topic alignment	Generator $\mathcal{G}$
Context precision	Context vs. query	Signal/noise in context	Retriever $\mathcal{R}$
Context recall	Context vs. gold answer	Completeness	Retriever $\mathcal{R}$

3.1Faithfulness

Faithfulness

$$F = \frac{|\{\,s \in S(\hat y) \;:\; s \;\text{is supported by}\; c\}|}{|S(\hat y)|}$$

Here $S(\hat y)$ decomposes the generated answer into atomic claims, each individually checked against the context $c$. A claim is considered "supported" if it is logically derivable from $c$.

✏️ Worked example: Faithfulness — materials engineering

Context $c$: "42CrMo4 is a tempered steel with a tensile strength of 900–1100 MPa. Hardness 28–34 HRC. Weldable with preheating."

Answer $\hat y$: "42CrMo4 has a tensile strength of 900–1100 MPa, a hardness of 28–34 HRC, is weldable with preheating, and is highly corrosion-resistant."

#	Atomic claim	Supported by $c$?
1	Tensile strength 900–1100 MPa	✓
2	Hardness 28–34 HRC	✓
3	Weldable with preheating	✓
4	Corrosion-resistant	✗ Hallucination!

Faithfulness =0.75 (3/4)

🎮 Interactive calculator: Faithfulness

Set the number of supported and total claims — the metric updates live.

Supported claims: 6

Total claims: 8

Faithfulness

—

Hallucination rate

—

3.2Answer Relevance

Answer Relevance

$$\text{AR}(q, \hat y) = \frac{1}{n}\sum_{j=1}^{n} \cos\bigl(\mathbf{e}(q),\; \mathbf{e}(\tilde q_j)\bigr) \quad\text{with}\quad \tilde q_j \sim \text{LLM}(\hat y)$$

From the generated answer $\hat y$, $n$ synthetic questions $\tilde q_j$ are back-generated. The cosine similarity of their embeddings to the original query $q$ measures whether the answer addresses the right question.

✏️ Worked example: Answer Relevance — PID controller

Query $q$: "How do I tune a PID controller?"

Synthetic question $\tilde q_j$	Cosine similarity
"How do I tune PID with Ziegler-Nichols?"	0.92
"What are $K_p$, $T_i$, $T_d$?"	0.68
"Who invented the PID controller?"	0.25

1

$\text{AR} = \frac{1}{3}(0.92 + 0.68 + 0.25) \approx 0.62$

Answer Relevance =0.62

3.3Context Recall

Context Recall

$$\text{CR} = \frac{|\{\,s \in S(y^*) \;:\; s \in c\}|}{|S(y^*)|}$$

The fraction of atomic claims from the gold answer $y^*$ that is covered by the retrieved context $c$. Low CR → retriever delivers incomplete context → the generator cannot deliver a complete answer even with perfect faithfulness.

✏️ Worked example: Context Recall — welding defects

Gold answer $y^*$ covers 6 claims: porosity, undercut, lack of fusion, cracks, spatter, root defects

Retrieved context $c$ contains: porosity, undercut, lack of fusion, spatter (4 of 6)

Context Recall =0.667 (4/6)

Fig. 4. RAGAS profile of two hypothetical systems compared. The plot visualizes strengths and weaknesses per dimension.

💡 Diagnostic use of the RAGAS metrics

The four metrics enable causal diagnosis: low context recall → retriever problem. Low faithfulness with good context recall → generator hallucinates. Low answer relevance → generator answers a different question. This modularity is the central advance over monolithic end-to-end metrics.

04

Fine-Tuning Metrics

Classical text-generation metrics evaluate how well a generated hypothesis $\hat y$ matches one (or more) reference answer(s) $y^*$. They differ primarily in granularity (n-grams vs. tokens) and in representation (lexical vs. semantic).

4.1BLEU (Bilingual Evaluation Understudy)

📖 Papineni et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002, pp. 311–318. DOI: 10.3115/1073083.1073135

BLEU-N

$$\text{BLEU} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{N} w_n \log p_n\right), \quad \text{BP} = \begin{cases} 1 & \text{if } |\hat y| > |y^*| \\ e^{1 - |y^*|/|\hat y|} & \text{otherwise}\end{cases}$$

BLEU geometrically averages n-gram precisions $p_n$ and multiplies by a brevity penalty (BP) that penalizes overly short hypotheses. Standard weights: $w_n = 1/N$ with $N=4$.

✏️ Worked example: BLEU-1 and BLEU-2

Reference: "The motor reaches a speed of 3000 revolutions per minute"

Generated: "The motor has a speed of 3000 revolutions per minute"

1

Unigram precision $p_1 = 8/9 \approx 0.889$ (most words match)

2

Bigram precision $p_2 = 6/8 = 0.750$

3

Equal length → $\text{BP} = 1$

BLEU-1 ≈0.889

BLEU-2 ≈0.816

4.2ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

📖 Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." ACL Workshop, pp. 74–81.

ROUGE-N

$$\text{ROUGE-}N = \frac{\sum_{g \in y^*} \min(\#g_{\hat y}, \#g_{y^*})}{\sum_{g \in y^*} \#g_{y^*}}$$

ROUGE-N is recall-oriented (the denominator is based on the reference, not the hypothesis). ROUGE-L uses the longest common subsequence (LCS) and is robust to word-order changes.

4.3BERTScore

📖 Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020. [arXiv:1904.09675]

BERTScore F1

$$P_{\text{BERT}} = \frac{1}{|\hat y|}\sum_{\hat x \in \hat y} \max_{x \in y^*} \mathbf{e}(\hat x)^{\!\top}\mathbf{e}(x), \quad F_1 = \frac{2 P R}{P+R}$$

Instead of counting exact token matches, BERTScore computes the maximum cosine similarity of contextualized embeddings (e.g., from BERT or RoBERTa). This solves the fundamental problem of lexical metrics: semantic equivalence under lexical variation.

✏️ BERTScore detects synonyms

Reference: "The transmission transmits a torque"

Generated: "The gearbox forwards a torque"

Lexical metrics: BLEU is low (different words). BERTScore: high, because "transmits" and "forwards" are semantically equivalent in this context.

BERTScore F1 ≈0.94

Fig. 5. Metric behavior as lexical variation increases. BERTScore degrades substantially more slowly than BLEU/ROUGE on semantically equivalent rephrasings.

⚠️ Known weaknesses of classical metrics

BLEU/ROUGE: correlate weakly with human judgment in open-domain generation (Reiter, 2018; Sai et al., 2022). With single reference answers the variance is extremely high.
BERTScore: depends on the encoder model used. Cross-lingual transfer and domain drift can distort values.
All three: can be subject to Goodhart's Law — "When a measure becomes a target, it ceases to be a good measure."

05

Statistical Evaluation & Significance

A single metric value on a test set is a point estimate with sampling variance. Without quantifying this uncertainty, statements like "system A is better than system B" are scientifically unfounded. This chapter introduces three canonical tools: confidence intervals, the bootstrap, and inter-annotator agreement.

5.1Confidence intervals for metrics

For a mean $\bar X$ over $n$ queries with sample standard deviation $s$, the central limit theorem gives approximately:

95% confidence interval (normal-approximated)

$$\text{CI}_{95\%} = \bar X \pm 1.96 \cdot \frac{s}{\sqrt{n}}$$

✏️ Worked example: CI for MRR

Over 100 test queries, $\overline{\text{MRR}} = 0.708$ with $s = 0.28$. The 95% CI is:

1

Standard error: $\text{SE} = 0.28/\sqrt{100} = 0.028$

2

Half-width: $1.96 \cdot 0.028 = 0.055$

3

$\text{CI}_{95\%} = [0.653,\; 0.763]$

95% CI =[0.653 ; 0.763]

5.2Bootstrap resampling

For non-normal metrics (e.g., NDCG, BLEU), the non-parametric bootstrap procedure (Efron, 1979) is preferable. From the test set, $B$ (typically 1000) random samples are drawn with replacement; quantiles are then read directly from the distribution of the $B$ metric values.

Fig. 6. Bootstrap distribution of NDCG@10 over 1000 resamples. The 95% CI is read off as the 2.5% and 97.5% quantiles (dashed lines).

5.3Inter-annotator agreement (Cohen's $\kappa$)

When relevance is annotated by humans, the reliability of the annotation itself must be checked. Cohen's kappa corrects observed agreement by the agreement expected by chance:

Cohen's Kappa

$$\kappa = \frac{p_o - p_e}{1 - p_e}$$

$\kappa$ value	Interpretation (Landis & Koch, 1977)
< 0.00	Worse than chance
0.01 – 0.20	Slight agreement
0.21 – 0.40	Fair agreement
0.41 – 0.60	Moderate agreement
0.61 – 0.80	Substantial agreement
0.81 – 1.00	Almost perfect

💡 Scientific reporting practice

It is now standard practice in conference papers (ACL, EMNLP, NeurIPS) to report every primary metric with a confidence interval plus a significance test against the baseline (e.g., paired bootstrap test). An improvement of 0.3 BLEU points without a significance statement is generally not considered publishable.

06

Interactive Metric Comparison

The following live demo shows the fundamental behavior of BLEU, ROUGE, and BERTScore on five classes of text variation. Choose a variation to see how the three metrics react in qualitatively different ways.

🎮 Metric playground

Pick a variation. The values are based on typical empirical magnitudes from Sai et al. (2022).

Reference: The motor reaches 3000 revolutions per minute.

Hypothesis: The motor reaches 3000 revolutions per minute.

BLEU-2

1.00

ROUGE-L

1.00

BERTScore

1.00

Identical hypothesis and reference → all three metrics return their maximum value.

⚠️ The contradiction case

Pay particular attention to the last variant: when the hypothesis nearly fully repeats the reference lexically but states the opposite semantically (e.g., "does NOT reach 3000…"), BLEU stays artificially high. This is a well-known failure mode of all n-gram metrics (Freitag et al., 2022) and one of the central motivations for developing NLI-based metrics such as BLEURT.

07

Decision Guide

The choice of the "right" metric depends on application context. The table below offers pragmatic recommendations for typical engineering scenarios:

Use case	Primary metric	Secondary metric	Reasoning
Safety data sheets (RAG)	Faithfulness	Context recall	Hallucinations are safety-critical
Standards / regulation search	Context recall	NDCG@10	Completeness over precision
FAQ bot / search UI	MRR	Precision@1	First result dominates UX
Machine translation	BLEU + chrF	BERTScore	Established standard metrics
Summarization	ROUGE-L	BERTScore	Recall orientation essential
Code generation	Pass@K (functional test)	BLEU	Correctness is binary
Open-domain QA	RAGAS profile (all 4)	Human eval	Multidimensional quality

7.1Anti-patterns: common evaluation mistakes

⚠️ What not to do

One-dimensional optimization: looking at only one metric ignores the multidimensional nature of quality.
Metric hacking (Goodhart's Law): directly optimizing on a metric often leads to quality loss on other axes.
Neglecting human evaluation: automatic metrics are proxies; final validation requires sample-based human evaluation.
Forgetting the retrieval component: in RAG, poor end-to-end quality is often wrongly attributed to the generator when in fact retrieval is failing.
Missing statistical significance: point estimates without confidence intervals are not informative (see Ch. 5).
Test set contamination: when training and test data overlap, all metrics become unreliable.

08

Python Implementation

The reference implementations below show the core logic. For production systems, use of established libraries is recommended (RAGAS, evaluate, bert_score, scikit-learn).

8.1Custom implementation of retrieval metrics

import numpy as np
from typing import List, Sequence

def precision_at_k(retrieved: Sequence[str], relevant: set, k: int) -> float:
    """Fraction of top-K hits that are relevant."""
    top_k = retrieved[:k]
    return len([d for d in top_k if d in relevant]) / k

def recall_at_k(retrieved: Sequence[str], relevant: set, k: int) -> float:
    """Fraction of all relevant items that appear in the top-K."""
    top_k = retrieved[:k]
    return len([d for d in top_k if d in relevant]) / len(relevant)

def reciprocal_rank(retrieved: Sequence[str], relevant: set) -> float:
    """1/rank of the first relevant hit (0 if none)."""
    for i, doc in enumerate(retrieved, 1):
        if doc in relevant:
            return 1.0 / i
    return 0.0

def mean_reciprocal_rank(all_retrieved, all_relevant) -> float:
    return float(np.mean([
        reciprocal_rank(r, rel) for r, rel in zip(all_retrieved, all_relevant)
    ]))

def ndcg_at_k(relevance_grades: Sequence[float], k: int) -> float:
    """NDCG with graded relevance."""
    def dcg(scores):
        return sum((2**s - 1) / np.log2(i + 2) for i, s in enumerate(scores[:k]))
    actual = dcg(relevance_grades)
    ideal  = dcg(sorted(relevance_grades, reverse=True))
    return actual / ideal if ideal > 0 else 0.0

8.2Bootstrap confidence interval

def bootstrap_ci(scores: np.ndarray, n_boot: int = 1000,
                 alpha: float = 0.05, seed: int = 42) -> tuple:
    """Non-parametric 95% CI via bootstrap resampling."""
    rng = np.random.default_rng(seed)
    n = len(scores)
    boot_means = np.empty(n_boot)
    for b in range(n_boot):
        idx = rng.integers(0, n, size=n)
        boot_means[b] = scores[idx].mean()
    lo = np.quantile(boot_means, alpha / 2)
    hi = np.quantile(boot_means, 1 - alpha / 2)
    return float(scores.mean()), float(lo), float(hi)

8.3RAGAS, BLEU, ROUGE, BERTScore (libraries)

# RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

results = evaluate(
    dataset=eval_ds,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)

# BLEU / ROUGE
from evaluate import load
bleu  = load("bleu")
rouge = load("rouge")
bleu_score  = bleu.compute(predictions=hyps,  references=[[r] for r in refs])
rouge_score = rouge.compute(predictions=hyps, references=refs)

# BERTScore (English)
from bert_score import score
P, R, F1 = score(hyps, refs, lang="en", model_type="roberta-large")

09

Self-Test

Six multiple-choice questions for self-assessment. Click an answer to receive immediate feedback.

10

Exercises

📝 Exercise 1: Precision & Recall

10 documents are retrieved; those at ranks 1, 3, 5, 8, 10 are relevant. The corpus contains 15 relevant documents in total.

Show solution

Precision@5 = $3/5 = 0.60$ | Precision@10 = $5/10 = 0.50$

Recall@5 = $3/15 = 0.20$ | Recall@10 = $5/15 \approx 0.333$

F1@10 = $\frac{2 \cdot 0.5 \cdot 0.333}{0.5 + 0.333} \approx 0.40$

📝 Exercise 2: MRR over 5 queries

First relevant ranks: $q_1 \to 1$, $q_2 \to 4$, $q_3 \to 2$, $q_4 \to 1$, $q_5 \to 3$

Show solution

$\text{MRR} = \dfrac{1 + 0.25 + 0.5 + 1 + 0.333}{5} = \dfrac{3.083}{5} \approx 0.617$

📝 Exercise 3: Faithfulness

Context: "Transformer: 230 V → 24 V, ratio 9.6 : 1, power 500 W". The answer additionally mentions "95% efficiency" (not present in the context).

Show solution

3 supported claims + 1 hallucination → $F = 3/4 = 0.75$

📝 Exercise 4: NDCG with graded relevance

Top-5 relevance grades: $(2, 3, 0, 1, 2)$. Compute NDCG@5.

Show solution

DCG = $3 + \frac{7}{1.585} + 0 + \frac{1}{2.322} + \frac{3}{2.585} = 3 + 4.416 + 0.431 + 1.160 = 9.007$

Ideal ordering $(3,2,2,1,0)$: IDCG = $7 + \frac{3}{1.585} + \frac{3}{2.000} + \frac{1}{2.322} + 0 = 7 + 1.893 + 1.500 + 0.431 = 10.824$

$\text{NDCG@}5 = 9.007 / 10.824 \approx 0.832$

📝 Exercise 5: Confidence interval

On a test set of $n=50$ queries the mean BERTScore is $\bar X = 0.82$ with $s = 0.11$. Compute the 95% CI.

Show solution

$\text{SE} = 0.11/\sqrt{50} \approx 0.0156$

Half-width: $1.96 \cdot 0.0156 \approx 0.0305$

$\text{CI}_{95\%} = [0.789,\; 0.851]$

11

Quick Reference & Bibliography

11.1Metric overview

Metric	Formula (short)	Range	Main use
Precision@K	$\|\mathcal{R}_K \cap \mathcal{V}\| / K$	$[0,1]$	Search
Recall@K	$\|\mathcal{R}_K \cap \mathcal{V}\| / \|\mathcal{V}\|$	$[0,1]$	Completeness
MRR	$\frac{1}{\|Q\|}\sum 1/\text{rank}_i$	$[0,1]$	First relevant hit
NDCG@K	DCG / IDCG	$[0,1]$	Graded relevance
Faithfulness	$\|\text{supported}\|/\|\text{total}\|$	$[0,1]$	Hallucination detection
BLEU	$\text{BP}\cdot\exp(\sum w_n \log p_n)$	$[0,1]$	Translation
ROUGE-N/L	n-gram / LCS recall	$[0,1]$	Summarization
BERTScore	$F_1$ of embedding cosines	$[0,1]$	Semantic evaluation

📚 Bibliography

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demonstrations, pp. 150–158.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002, pp. 311–318.
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop on Text Summarization, pp. 74–81.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, 20(4), pp. 422–446.
Sai, A. B., Mohankumar, A. K., & Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Computing Surveys, 55(2).
Reiter, E. (2018). A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3), pp. 393–401.
Ovadia, O., et al. (2024). Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. EMNLP 2024.
Freitag, M., et al. (2022). Results of the WMT22 Metrics Shared Task. WMT 2022.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), pp. 1–26.
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), pp. 159–174.
Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation. ACL 2020.