RAG & Fine-Tuned LLM Evaluation Guide

Evaluating Retrieval-Augmented Generation Systems
and Fine-Tuned Large Language Models

An introduction with examples, mathematical derivation, and discussion of common evaluation metrics.

EditionVersion 1.0
Last updatedApril 2026
01

Introduction & Foundations

1.1What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) denotes an architectural pattern that extends the static knowledge base of a Large Language Model (LLM) with dynamic access to external document collections. Formally, the process is described as a three-stage pipeline:

RAG Pipeline (formal definition)
$$\hat{y} = \mathcal{G}\bigl(q,\; \mathcal{R}(q;\, \mathcal{D})\bigr) \quad\text{with}\quad \mathcal{R}(q;\mathcal{D}) = \operatorname*{argtop\text{-}k}_{d \in \mathcal{D}} \;\operatorname{sim}(q, d)$$

Here $q$ denotes the query, $\mathcal{D}$ the document collection, $\mathcal{R}$ the retriever (e.g., a dense bi-encoder with cosine similarity), and $\mathcal{G}$ the generator (the LLM). The composition $\mathcal{G} \circ \mathcal{R}$ unifies parametric knowledge (in the model weights) with non-parametric knowledge (in the index) โ€” a concept discussed in the literature as hybrid memory (Lewis et al., 2020).

๐Ÿ“– Scientific basis: Lewis et al. (2020) introduced RAG as an architecture. Es et al. (2024) provided RAGAS, today's most-cited method for reference-free evaluation. [EACL 2024]

๐Ÿ”ง Application example: RAG in manufacturing engineering

Query: "How do I optimize cutting speed for CNC milling of aluminum?"

Retrieved documents (top-2):

  • $d_1$: "For AlMg3 we recommend cutting speeds of 200โ€“300 m/min..."
  • $d_2$: "The feed rate for aluminum should be 0.1โ€“0.3 mm/tooth..."

Generated answer: "For CNC milling of aluminum I recommend 200โ€“300 m/min with HSS cutters and a feed rate of 0.1โ€“0.3 mm/tooth."

Note: answer quality depends both on retrieval quality and on generator faithfulness. Both components require separate metrics.

1.2RAG vs. Fine-Tuning: A conceptual comparison

๐Ÿ”„ RAG approach

  • Knowledge externally indexed (non-parametric)
  • Updates possible without retraining
  • Source citation and traceability
  • Higher inference latency (retrieval step)
  • Suited to fact-intensive domains

๐ŸŽฏ Fine-tuning approach

  • Knowledge in model weights (parametric)
  • Updates require retraining
  • No intrinsic source attribution
  • Lower inference latency
  • Suited to style, format, and domain adaptation

๐Ÿ’ก Design decision in practice

The two approaches are complementary, not competing. Empirical studies (Ovadia et al., 2024) show that RAG is superior for fact-based tasks, while fine-tuning is preferable for stylistic adaptation and latency-critical applications. The hybrid approach โ€” fine-tuning on domain style plus RAG for current facts โ€” is state-of-the-art in production systems.

02

Retrieval Metrics

Retrieval metrics evaluate the quality of the component $\mathcal{R}$. They differ in whether they (a) consider pure set relations (precision, recall), (b) account for position (MRR), or (c) incorporate graded relevance (NDCG).

2.1Precision@K and Recall@K

Precision & Recall at cutoff K
$$\text{Precision@}K = \frac{|\mathcal{R}_K \cap \mathcal{V}|}{K}, \qquad \text{Recall@}K = \frac{|\mathcal{R}_K \cap \mathcal{V}|}{|\mathcal{V}|}$$

Here $\mathcal{R}_K$ is the set of top-$K$ retrieved documents and $\mathcal{V}$ is the set of all relevant documents in the corpus (ground truth).

๐Ÿ“ Statistical interpretation (click to expand)

Precision@K is the empirical estimator of the conditional probability $P(\text{relevant} \mid \text{retrieved in top-}K)$. Recall@K estimates $P(\text{retrieved in top-}K \mid \text{relevant})$. Both are asymmetric: a system with high precision but low recall is "precise but incomplete"; conversely, it returns many hits but with a high noise fraction.

The F1 score $F_1 = 2 \cdot \frac{P \cdot R}{P+R}$ forms the harmonic mean and is more pessimistic than the arithmetic mean (it penalizes imbalance more strongly).

โœ๏ธ Worked example: Precision@5 โ€” Thiele-Small parameters

Query: "How do I determine the Thiele-Small parameters of a loudspeaker?"

RankDocumentRelevant?
1"Thiele-Small measurement via impedance analysis"โœ“
2"Enclosure design for bass loudspeakers"โœ—
3"Electromechanical modeling"โœ“
4"Crossover design"โœ—
5"Determining $Q_{ts}$, $V_{as}$ and $f_s$"โœ“
1
Relevant in top-5: 3 (ranks 1, 3, 5)
2
$\text{Precision@}5 = \dfrac{3}{5} = 0.60$
Precision@5 =0.60

๐ŸŽฎ Interactive calculator: Precision & Recall

Click on the ranks to mark them as "relevant". Metrics are computed live.

8
Precision@10
โ€”
Recall@10
โ€”
F1@10
โ€”

2.2Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank
$$\text{MRR} = \frac{1}{|Q|}\sum_{i=1}^{|Q|} \frac{1}{\operatorname{rank}_i}$$

$\operatorname{rank}_i$ denotes the position of the first relevant document for query $q_i$. MRR is especially meaningful when the user typically reads only the first result (search UIs, FAQ bots).

โœ๏ธ Worked example: MRR over 4 queries

QueryFirst relevant rankReciprocal
$q_1$: "Thiele-Small parameters"1$1/1 = 1.000$
$q_2$: "FEM simulation"3$1/3 = 0.333$
$q_3$: "ISO tolerance calculation"2$1/2 = 0.500$
$q_4$: "Heat treatment of steel"1$1/1 = 1.000$
1
$\text{MRR} = \frac{1}{4}(1.000 + 0.333 + 0.500 + 1.000) = 0.708$
MRR =0.708

2.3Normalized Discounted Cumulative Gain (NDCG)

๐Ÿ“– Scientific basis: Jรคrvelin & Kekรคlรคinen (2002). "Cumulated Gain-Based Evaluation of IR Techniques." ACM TOIS, 20(4), pp. 422โ€“446. DOI: 10.1145/582415.582418
DCG ยท IDCG ยท NDCG
$$\text{DCG@}K = \sum_{i=1}^{K} \frac{2^{\operatorname{rel}_i} - 1}{\log_{2}(i+1)}, \qquad \text{NDCG@}K = \frac{\text{DCG@}K}{\text{IDCG@}K}$$

NDCG is the only one of the metrics presented here that admits graded relevance (e.g., $\operatorname{rel}_i \in \{0,1,2,3\}$): "not relevant", "marginal", "relevant", "highly relevant". The logarithm in the denominator implements position discounting: later hits contribute less to the overall score.

โœ๏ธ Worked example: NDCG@5 โ€” AOI defect detection

Rank $i$Document$\operatorname{rel}_i$$2^{\operatorname{rel}_i}-1$$\log_2(i+1)$Gain
1"Deep learning for AOI"371.0007.000
2"Introduction to QA"001.5850.000
3"AOI illumination"232.0001.500
4"Statistics"112.3220.431
5"CNN defect classification"372.5852.708
1
$\text{DCG@}5 = 7.00 + 0 + 1.50 + 0.43 + 2.71 = 11.64$
2
Ideal ordering: $(3,3,2,1,0)$ โ†’ $\text{IDCG@}5 = 13.36$
3
$\text{NDCG@}5 = \dfrac{11.64}{13.36} = 0.871$
NDCG@5 =0.871
Fig. 1. Comparison of the four retrieval metrics for the shared application scenario.

2.4Confusion matrix & precision-recall curve

A complete diagnosis of retrieval behavior requires examining the full trade-off curve, not just a single point:

Fig. 2. Precision-recall curves of three hypothetical retrievers. The average precision (AP) corresponds to the area under the curve and is the position-aware generalization of precision.
Fig. 3. Confusion matrix for retrieval classification. TP = relevant + retrieved, FP = irrelevant + retrieved, FN = relevant + not retrieved, TN = irrelevant + not retrieved.

โš ๏ธ Critical limitation

All four metrics presuppose that ground-truth relevance is known โ€” in practice an annotation-expensive process. In large corpora, therefore, only a sample is typically annotated, leading to wide confidence intervals. Furthermore, semantically similar but lexically different documents are often incorrectly marked as irrelevant (annotator bias).

03

RAGAS Framework: Generation Metrics

๐Ÿ“– Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024 Demonstrations, pp. 150โ€“158. [PDF]

The RAGAS framework is reference-free: it requires no human-annotated gold answers, instead using a "judge LLM" to operationalize the evaluation criteria. The four main metrics systematically cover both pipeline components:

MetricComparesQuantifiesComponent
FaithfulnessAnswer vs. contextFaithfulness / hallucinationGenerator $\mathcal{G}$
Answer relevanceAnswer vs. queryTopic alignmentGenerator $\mathcal{G}$
Context precisionContext vs. querySignal/noise in contextRetriever $\mathcal{R}$
Context recallContext vs. gold answerCompletenessRetriever $\mathcal{R}$

3.1Faithfulness

Faithfulness
$$F = \frac{|\{\,s \in S(\hat y) \;:\; s \;\text{is supported by}\; c\}|}{|S(\hat y)|}$$

Here $S(\hat y)$ decomposes the generated answer into atomic claims, each individually checked against the context $c$. A claim is considered "supported" if it is logically derivable from $c$.

โœ๏ธ Worked example: Faithfulness โ€” materials engineering

Context $c$: "42CrMo4 is a tempered steel with a tensile strength of 900โ€“1100 MPa. Hardness 28โ€“34 HRC. Weldable with preheating."

Answer $\hat y$: "42CrMo4 has a tensile strength of 900โ€“1100 MPa, a hardness of 28โ€“34 HRC, is weldable with preheating, and is highly corrosion-resistant."

#Atomic claimSupported by $c$?
1Tensile strength 900โ€“1100 MPaโœ“
2Hardness 28โ€“34 HRCโœ“
3Weldable with preheatingโœ“
4Corrosion-resistantโœ— Hallucination!
Faithfulness =0.75 (3/4)

๐ŸŽฎ Interactive calculator: Faithfulness

Set the number of supported and total claims โ€” the metric updates live.

6
8
Faithfulness
โ€”
Hallucination rate
โ€”

3.2Answer Relevance

Answer Relevance
$$\text{AR}(q, \hat y) = \frac{1}{n}\sum_{j=1}^{n} \cos\bigl(\mathbf{e}(q),\; \mathbf{e}(\tilde q_j)\bigr) \quad\text{with}\quad \tilde q_j \sim \text{LLM}(\hat y)$$

From the generated answer $\hat y$, $n$ synthetic questions $\tilde q_j$ are back-generated. The cosine similarity of their embeddings to the original query $q$ measures whether the answer addresses the right question.

โœ๏ธ Worked example: Answer Relevance โ€” PID controller

Query $q$: "How do I tune a PID controller?"

Synthetic question $\tilde q_j$Cosine similarity
"How do I tune PID with Ziegler-Nichols?"0.92
"What are $K_p$, $T_i$, $T_d$?"0.68
"Who invented the PID controller?"0.25
1
$\text{AR} = \frac{1}{3}(0.92 + 0.68 + 0.25) \approx 0.62$
Answer Relevance =0.62

3.3Context Recall

Context Recall
$$\text{CR} = \frac{|\{\,s \in S(y^*) \;:\; s \in c\}|}{|S(y^*)|}$$

The fraction of atomic claims from the gold answer $y^*$ that is covered by the retrieved context $c$. Low CR → retriever delivers incomplete context → the generator cannot deliver a complete answer even with perfect faithfulness.

โœ๏ธ Worked example: Context Recall โ€” welding defects

Gold answer $y^*$ covers 6 claims: porosity, undercut, lack of fusion, cracks, spatter, root defects

Retrieved context $c$ contains: porosity, undercut, lack of fusion, spatter (4 of 6)

Context Recall =0.667 (4/6)
Fig. 4. RAGAS profile of two hypothetical systems compared. The plot visualizes strengths and weaknesses per dimension.

๐Ÿ’ก Diagnostic use of the RAGAS metrics

The four metrics enable causal diagnosis: low context recall โ†’ retriever problem. Low faithfulness with good context recall โ†’ generator hallucinates. Low answer relevance โ†’ generator answers a different question. This modularity is the central advance over monolithic end-to-end metrics.

04

Fine-Tuning Metrics

Classical text-generation metrics evaluate how well a generated hypothesis $\hat y$ matches one (or more) reference answer(s) $y^*$. They differ primarily in granularity (n-grams vs. tokens) and in representation (lexical vs. semantic).

4.1BLEU (Bilingual Evaluation Understudy)

๐Ÿ“– Papineni et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002, pp. 311โ€“318. DOI: 10.3115/1073083.1073135
BLEU-N
$$\text{BLEU} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{N} w_n \log p_n\right), \quad \text{BP} = \begin{cases} 1 & \text{if } |\hat y| > |y^*| \\ e^{1 - |y^*|/|\hat y|} & \text{otherwise}\end{cases}$$

BLEU geometrically averages n-gram precisions $p_n$ and multiplies by a brevity penalty (BP) that penalizes overly short hypotheses. Standard weights: $w_n = 1/N$ with $N=4$.

โœ๏ธ Worked example: BLEU-1 and BLEU-2

Reference: "The motor reaches a speed of 3000 revolutions per minute"

Generated: "The motor has a speed of 3000 revolutions per minute"

1
Unigram precision $p_1 = 8/9 \approx 0.889$ (most words match)
2
Bigram precision $p_2 = 6/8 = 0.750$
3
Equal length โ†’ $\text{BP} = 1$
BLEU-1 โ‰ˆ0.889
BLEU-2 โ‰ˆ0.816

4.2ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

๐Ÿ“– Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." ACL Workshop, pp. 74โ€“81.
ROUGE-N
$$\text{ROUGE-}N = \frac{\sum_{g \in y^*} \min(\#g_{\hat y}, \#g_{y^*})}{\sum_{g \in y^*} \#g_{y^*}}$$

ROUGE-N is recall-oriented (the denominator is based on the reference, not the hypothesis). ROUGE-L uses the longest common subsequence (LCS) and is robust to word-order changes.

4.3BERTScore

๐Ÿ“– Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020. [arXiv:1904.09675]
BERTScore F1
$$P_{\text{BERT}} = \frac{1}{|\hat y|}\sum_{\hat x \in \hat y} \max_{x \in y^*} \mathbf{e}(\hat x)^{\!\top}\mathbf{e}(x), \quad F_1 = \frac{2 P R}{P+R}$$

Instead of counting exact token matches, BERTScore computes the maximum cosine similarity of contextualized embeddings (e.g., from BERT or RoBERTa). This solves the fundamental problem of lexical metrics: semantic equivalence under lexical variation.

โœ๏ธ BERTScore detects synonyms

Reference: "The transmission transmits a torque"

Generated: "The gearbox forwards a torque"

Lexical metrics: BLEU is low (different words). BERTScore: high, because "transmits" and "forwards" are semantically equivalent in this context.

BERTScore F1 โ‰ˆ0.94
Fig. 5. Metric behavior as lexical variation increases. BERTScore degrades substantially more slowly than BLEU/ROUGE on semantically equivalent rephrasings.

โš ๏ธ Known weaknesses of classical metrics

  • BLEU/ROUGE: correlate weakly with human judgment in open-domain generation (Reiter, 2018; Sai et al., 2022). With single reference answers the variance is extremely high.
  • BERTScore: depends on the encoder model used. Cross-lingual transfer and domain drift can distort values.
  • All three: can be subject to Goodhart's Law โ€” "When a measure becomes a target, it ceases to be a good measure."
05

Statistical Evaluation & Significance

A single metric value on a test set is a point estimate with sampling variance. Without quantifying this uncertainty, statements like "system A is better than system B" are scientifically unfounded. This chapter introduces three canonical tools: confidence intervals, the bootstrap, and inter-annotator agreement.

5.1Confidence intervals for metrics

For a mean $\bar X$ over $n$ queries with sample standard deviation $s$, the central limit theorem gives approximately:

95% confidence interval (normal-approximated)
$$\text{CI}_{95\%} = \bar X \pm 1.96 \cdot \frac{s}{\sqrt{n}}$$

โœ๏ธ Worked example: CI for MRR

Over 100 test queries, $\overline{\text{MRR}} = 0.708$ with $s = 0.28$. The 95% CI is:

1
Standard error: $\text{SE} = 0.28/\sqrt{100} = 0.028$
2
Half-width: $1.96 \cdot 0.028 = 0.055$
3
$\text{CI}_{95\%} = [0.653,\; 0.763]$
95% CI =[0.653 ; 0.763]

5.2Bootstrap resampling

For non-normal metrics (e.g., NDCG, BLEU), the non-parametric bootstrap procedure (Efron, 1979) is preferable. From the test set, $B$ (typically 1000) random samples are drawn with replacement; quantiles are then read directly from the distribution of the $B$ metric values.

Fig. 6. Bootstrap distribution of NDCG@10 over 1000 resamples. The 95% CI is read off as the 2.5% and 97.5% quantiles (dashed lines).

5.3Inter-annotator agreement (Cohen's $\kappa$)

When relevance is annotated by humans, the reliability of the annotation itself must be checked. Cohen's kappa corrects observed agreement by the agreement expected by chance:

Cohen's Kappa
$$\kappa = \frac{p_o - p_e}{1 - p_e}$$
$\kappa$ valueInterpretation (Landis & Koch, 1977)
< 0.00Worse than chance
0.01 โ€“ 0.20Slight agreement
0.21 โ€“ 0.40Fair agreement
0.41 โ€“ 0.60Moderate agreement
0.61 โ€“ 0.80Substantial agreement
0.81 โ€“ 1.00Almost perfect

๐Ÿ’ก Scientific reporting practice

It is now standard practice in conference papers (ACL, EMNLP, NeurIPS) to report every primary metric with a confidence interval plus a significance test against the baseline (e.g., paired bootstrap test). An improvement of 0.3 BLEU points without a significance statement is generally not considered publishable.

06

Interactive Metric Comparison

The following live demo shows the fundamental behavior of BLEU, ROUGE, and BERTScore on five classes of text variation. Choose a variation to see how the three metrics react in qualitatively different ways.

๐ŸŽฎ Metric playground

Pick a variation. The values are based on typical empirical magnitudes from Sai et al. (2022).

Reference: The motor reaches 3000 revolutions per minute.
Hypothesis: The motor reaches 3000 revolutions per minute.
BLEU-2
1.00
ROUGE-L
1.00
BERTScore
1.00

Identical hypothesis and reference โ†’ all three metrics return their maximum value.

โš ๏ธ The contradiction case

Pay particular attention to the last variant: when the hypothesis nearly fully repeats the reference lexically but states the opposite semantically (e.g., "does NOT reach 3000โ€ฆ"), BLEU stays artificially high. This is a well-known failure mode of all n-gram metrics (Freitag et al., 2022) and one of the central motivations for developing NLI-based metrics such as BLEURT.

07

Decision Guide

The choice of the "right" metric depends on application context. The table below offers pragmatic recommendations for typical engineering scenarios:

Use casePrimary metricSecondary metricReasoning
Safety data sheets (RAG)FaithfulnessContext recallHallucinations are safety-critical
Standards / regulation searchContext recallNDCG@10Completeness over precision
FAQ bot / search UIMRRPrecision@1First result dominates UX
Machine translationBLEU + chrFBERTScoreEstablished standard metrics
SummarizationROUGE-LBERTScoreRecall orientation essential
Code generationPass@K (functional test)BLEUCorrectness is binary
Open-domain QARAGAS profile (all 4)Human evalMultidimensional quality

7.1Anti-patterns: common evaluation mistakes

โš ๏ธ What not to do

  • One-dimensional optimization: looking at only one metric ignores the multidimensional nature of quality.
  • Metric hacking (Goodhart's Law): directly optimizing on a metric often leads to quality loss on other axes.
  • Neglecting human evaluation: automatic metrics are proxies; final validation requires sample-based human evaluation.
  • Forgetting the retrieval component: in RAG, poor end-to-end quality is often wrongly attributed to the generator when in fact retrieval is failing.
  • Missing statistical significance: point estimates without confidence intervals are not informative (see Ch. 5).
  • Test set contamination: when training and test data overlap, all metrics become unreliable.
08

Python Implementation

The reference implementations below show the core logic. For production systems, use of established libraries is recommended (RAGAS, evaluate, bert_score, scikit-learn).

8.1Custom implementation of retrieval metrics

import numpy as np
from typing import List, Sequence

def precision_at_k(retrieved: Sequence[str], relevant: set, k: int) -> float:
    """Fraction of top-K hits that are relevant."""
    top_k = retrieved[:k]
    return len([d for d in top_k if d in relevant]) / k

def recall_at_k(retrieved: Sequence[str], relevant: set, k: int) -> float:
    """Fraction of all relevant items that appear in the top-K."""
    top_k = retrieved[:k]
    return len([d for d in top_k if d in relevant]) / len(relevant)

def reciprocal_rank(retrieved: Sequence[str], relevant: set) -> float:
    """1/rank of the first relevant hit (0 if none)."""
    for i, doc in enumerate(retrieved, 1):
        if doc in relevant:
            return 1.0 / i
    return 0.0

def mean_reciprocal_rank(all_retrieved, all_relevant) -> float:
    return float(np.mean([
        reciprocal_rank(r, rel) for r, rel in zip(all_retrieved, all_relevant)
    ]))

def ndcg_at_k(relevance_grades: Sequence[float], k: int) -> float:
    """NDCG with graded relevance."""
    def dcg(scores):
        return sum((2**s - 1) / np.log2(i + 2) for i, s in enumerate(scores[:k]))
    actual = dcg(relevance_grades)
    ideal  = dcg(sorted(relevance_grades, reverse=True))
    return actual / ideal if ideal > 0 else 0.0

8.2Bootstrap confidence interval

def bootstrap_ci(scores: np.ndarray, n_boot: int = 1000,
                 alpha: float = 0.05, seed: int = 42) -> tuple:
    """Non-parametric 95% CI via bootstrap resampling."""
    rng = np.random.default_rng(seed)
    n = len(scores)
    boot_means = np.empty(n_boot)
    for b in range(n_boot):
        idx = rng.integers(0, n, size=n)
        boot_means[b] = scores[idx].mean()
    lo = np.quantile(boot_means, alpha / 2)
    hi = np.quantile(boot_means, 1 - alpha / 2)
    return float(scores.mean()), float(lo), float(hi)

8.3RAGAS, BLEU, ROUGE, BERTScore (libraries)

# RAGAS
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

results = evaluate(
    dataset=eval_ds,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)

# BLEU / ROUGE
from evaluate import load
bleu  = load("bleu")
rouge = load("rouge")
bleu_score  = bleu.compute(predictions=hyps,  references=[[r] for r in refs])
rouge_score = rouge.compute(predictions=hyps, references=refs)

# BERTScore (English)
from bert_score import score
P, R, F1 = score(hyps, refs, lang="en", model_type="roberta-large")
09

Self-Test

Six multiple-choice questions for self-assessment. Click an answer to receive immediate feedback.

10

Exercises

๐Ÿ“ Exercise 1: Precision & Recall

10 documents are retrieved; those at ranks 1, 3, 5, 8, 10 are relevant. The corpus contains 15 relevant documents in total.

Show solution

Precision@5 = $3/5 = 0.60$  |  Precision@10 = $5/10 = 0.50$

Recall@5 = $3/15 = 0.20$  |  Recall@10 = $5/15 \approx 0.333$

F1@10 = $\frac{2 \cdot 0.5 \cdot 0.333}{0.5 + 0.333} \approx 0.40$

๐Ÿ“ Exercise 2: MRR over 5 queries

First relevant ranks: $q_1 \to 1$, $q_2 \to 4$, $q_3 \to 2$, $q_4 \to 1$, $q_5 \to 3$

Show solution

$\text{MRR} = \dfrac{1 + 0.25 + 0.5 + 1 + 0.333}{5} = \dfrac{3.083}{5} \approx 0.617$

๐Ÿ“ Exercise 3: Faithfulness

Context: "Transformer: 230 V โ†’ 24 V, ratio 9.6 : 1, power 500 W". The answer additionally mentions "95% efficiency" (not present in the context).

Show solution

3 supported claims + 1 hallucination โ†’ $F = 3/4 = 0.75$

๐Ÿ“ Exercise 4: NDCG with graded relevance

Top-5 relevance grades: $(2, 3, 0, 1, 2)$. Compute NDCG@5.

Show solution

DCG = $3 + \frac{7}{1.585} + 0 + \frac{1}{2.322} + \frac{3}{2.585} = 3 + 4.416 + 0.431 + 1.160 = 9.007$

Ideal ordering $(3,2,2,1,0)$: IDCG = $7 + \frac{3}{1.585} + \frac{3}{2.000} + \frac{1}{2.322} + 0 = 7 + 1.893 + 1.500 + 0.431 = 10.824$

$\text{NDCG@}5 = 9.007 / 10.824 \approx 0.832$

๐Ÿ“ Exercise 5: Confidence interval

On a test set of $n=50$ queries the mean BERTScore is $\bar X = 0.82$ with $s = 0.11$. Compute the 95% CI.

Show solution

$\text{SE} = 0.11/\sqrt{50} \approx 0.0156$

Half-width: $1.96 \cdot 0.0156 \approx 0.0305$

$\text{CI}_{95\%} = [0.789,\; 0.851]$

11

Quick Reference & Bibliography

11.1Metric overview

MetricFormula (short)RangeMain use
Precision@K$|\mathcal{R}_K \cap \mathcal{V}| / K$$[0,1]$Search
Recall@K$|\mathcal{R}_K \cap \mathcal{V}| / |\mathcal{V}|$$[0,1]$Completeness
MRR$\frac{1}{|Q|}\sum 1/\text{rank}_i$$[0,1]$First relevant hit
NDCG@KDCG / IDCG$[0,1]$Graded relevance
Faithfulness$|\text{supported}|/|\text{total}|$$[0,1]$Hallucination detection
BLEU$\text{BP}\cdot\exp(\sum w_n \log p_n)$$[0,1]$Translation
ROUGE-N/Ln-gram / LCS recall$[0,1]$Summarization
BERTScore$F_1$ of embedding cosines$[0,1]$Semantic evaluation

๐Ÿ“š Bibliography

  1. Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. EACL 2024 Demonstrations, pp. 150โ€“158.
  2. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
  3. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002, pp. 311โ€“318.
  4. Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. ACL Workshop on Text Summarization, pp. 74โ€“81.
  5. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
  6. Jรคrvelin, K., & Kekรคlรคinen, J. (2002). Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, 20(4), pp. 422โ€“446.
  7. Sai, A. B., Mohankumar, A. K., & Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Computing Surveys, 55(2).
  8. Reiter, E. (2018). A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3), pp. 393โ€“401.
  9. Ovadia, O., et al. (2024). Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs. EMNLP 2024.
  10. Freitag, M., et al. (2022). Results of the WMT22 Metrics Shared Task. WMT 2022.
  11. Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), pp. 1โ€“26.
  12. Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), pp. 159โ€“174.
  13. Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation. ACL 2020.