Eta-Squared (η²) — Interactive Explainer

What is Eta-Squared?

Eta-squared (η²) measures how much of the variance in a numerical variable is explained by a categorical variable. It is one of the most widely used effect-size measures in statistics and machine learning feature analysis.

The core question

"If I know which group a sample belongs to, how much better can I predict its numerical value?"

η² = 0 → knowing the group tells us nothing new
η² = 1 → knowing the group tells us everything
In between → category partially explains the variation

Why it matters in ML

Feature selection — score categorical features against numerical targets
EDA — quickly spot meaningful group differences before modelling
ANOVA companion — complements the F-test with a scale-free effect size
Communication — a single number in [0, 1] that stakeholders can grasp

The key formula at a glance

Eta-squared as a ratio of sums of squares η² = SSbetween SStotal = variance explained by the category total variance in the data

🎯 What you'll be able to do after this app

Derive η² from the ANOVA variance decomposition
Compute it by hand from a small data table
Feel why tight clusters with far-apart means give high η²
Interpret η² using Cohen's benchmarks (0.01 / 0.06 / 0.14)
Know the limitations — bias, non-linearity, sample size

Theoretical Background

η² lives inside the ANOVA (Analysis of Variance) framework. Everything flows from one elegant identity: the total variation in the data splits cleanly into a between-group piece and a within-group piece.

Setup & notation

Suppose we have a numerical response y and a categorical predictor with k levels (groups). For group i:

yij — the j-th observation in group i
ni — number of observations in group i
ȳi — mean of group i (group mean)
ȳ — mean of all observations (grand mean)
N = Σ ni — total sample size

The variance decomposition identity

Every single observation's deviation from the grand mean can be split into two parts:

Deviation decomposition for a single point (yij − ȳ) = (ȳi − ȳ) + (yij − ȳi)

total deviation = between-group part + within-group part

Squaring and summing over all points, the cross-term vanishes (because group deviations sum to zero inside each group), and we get the sum-of-squares identity:

The fundamental ANOVA identity SStotal = SSbetween + SSwithin

Each sum of squares, explicitly

Total SS

SST = Σi,j (yij − ȳ)2

Total variation of all points around the grand mean.

Between SS

SSB = Σi ni (ȳi − ȳ)2

Variation of the group means around the grand mean — "explained" variance.

Within SS

SSW = Σi,j (yij − ȳi)2

Variation of points around their own group mean — "residual" / unexplained variance.

Eta-squared definition

Definition η² = SSbetween SStotal = 1 − SSwithin SStotal

Key properties

Bounded: 0 ≤ η² ≤ 1
Scale-free: invariant to linear rescaling of y
Symmetric interpretation: η² = R² of a one-hot ANOVA regression
Link to F-statistic: F = η² / (k−1)(1 − η²) / (N − k)

⚠ Important caveats

Positive bias in small samples — η² overestimates the true population effect; ω² or ηadj² correct for this.
Linear, monotone view of "association" — only captures differences in means, not in variances or shapes.
Assumptions — classical ANOVA interpretation assumes independence, roughly equal group variances (homoscedasticity), and approximately normal residuals.
Dependent on the number of groups — more groups ⇒ more opportunity to capture structure, can inflate η².

Visual Intuition — Three Scenarios

The same three groups, the same number of points, the same overall range — but the arrangement changes. Watch where the variance "lives": inside each group, or between the groups?

η² ≈ 0.00

No association

Group means coincide with the grand mean.
All variance is within groups.

η² ≈ 0.45

Moderate association

Means differ, but groups overlap.
Variance is split between & within.

η² ≈ 0.95

Strong association

Tight clusters at very different levels.
Nearly all variance is between groups.

Group A

Group B

Group C

Grand mean (dashed)

Group means (solid)

Reading the plots

Look at two distances for each point:

Distance from the grand mean (dashed line) → contributes to SStotal
Distance from its group mean (solid colored bar) → contributes to SSwithin
Distance from group mean to grand mean → contributes to SSbetween

η² is just the proportion of the total squared distance that is "captured" by the group-mean shifts.

Interactive Calculator

Drag the sliders to reposition the three groups and adjust the within-group spread. Watch η², the sums of squares, and the variance partition update in real time.

Controls

■ Mean of Group A (μ_A) 30

■ Mean of Group B (μ_B) 50

■ Mean of Group C (μ_C) 70

Within-group spread (σ) 8

Points per group (n) 8

Quick presets

Show

Eta-squared

0.625

Large effect

SS_between

—

SS_within

—

SS_total

—

Variance partition SS_between / SS_total = η²

between 62.5%

within 37.5%

Things to try

Set all three group means equal → watch η² drop toward 0.
Spread the means apart while keeping σ small → η² climbs toward 1.
Keep means fixed and increase σ → η² falls (within-group noise drowns the signal).
Increase n with fixed means & σ → η² is roughly stable (it is a proportion, not a sum).

Worked Example — η² between two columns

Imagine a DataFrame with two columns: one categorical (department) and one numerical (salary_k). We want to quantify how strongly the department explains the salary variation — a typical feature-analysis task. Let's compute η² step by step.

The dataset — two columns, 9 rows

index	department (categorical)	salary_k (numerical)
0	Sales	35
1	Sales	40
2	Sales	45
3	Engineering	55
4	Engineering	60
5	Engineering	65
6	HR	45
7	HR	50
8	HR	55

N = 9 rows, k = 3 distinct categories

Our goal

Compute η²(department, salary_k) — the fraction of the salary variance explained by department membership.

Plan:

Split the numerical column by the categorical column
Get each group's size n_i and mean ȳ_i
Get the grand mean ȳ (over all 9 rows)
Compute the between-group variation B
Compute the total variation T
η² = B / T

1Group the numerical column by the category

Conceptually: df.groupby('department')['salary_k']

department	values of salary_k	size n_i	group mean ȳ_i
Sales	35, 40, 45	3	(35+40+45)/3 = 40
Engineering	55, 60, 65	3	(55+60+65)/3 = 60
HR	45, 50, 55	3	(45+50+55)/3 = 50

2Compute the grand mean ȳ

Average of all 9 values in the numerical column, ignoring the category.

ȳ = (35 + 40 + 45 + 55 + 60 + 65 + 45 + 50 + 55) / 9

= 450 / 9 = 50.0

Equivalent check with group means: ȳ = (3·40 + 3·60 + 3·50) / 9 = 450/9 = 50 ✓

3Compute the between-group variation B

How much does each group's mean deviate from the overall mean? Square those deviations and weight by group size.

B = Σ_i n_i (ȳ_i − ȳ)²

department	ȳ_i	ȳ_i − ȳ	(ȳ_i − ȳ)²	n_i·(ȳ_i − ȳ)²
Sales	40	40 − 50 = −10	100	3·100 = 300
Engineering	60	60 − 50 = +10	100	3·100 = 300
HR	50	50 − 50 = 0	0	3·0 = 0
Σ =				600

B = 300 + 300 + 0 = 600

4Compute the total variation T

How much does each single value deviate from the grand mean? Square and sum over all 9 rows — the category is ignored here.

T = Σ (y − ȳ)²

salary_k (y)	y − ȳ	(y − ȳ)²
35	−15	225
40	−10	100
45	−5	25
55	+5	25
60	+10	100
65	+15	225
45	−5	25
50	0	0
55	+5	25
Σ =		750

T = 750

5Compute η²

η² is simply the ratio — the fraction of the total variation that is captured by the group-mean shifts.

η² = B T = between-group variation total variation = 600 750 = 0.80

η²(department, salary_k) = 0.80 → 80% of the salary variation is captured by department

The same computation in Python

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales']*3 + ['Engineering']*3 + ['HR']*3,
    'salary_k':   [35, 40, 45, 55, 60, 65, 45, 50, 55],
})

y          = df['salary_k']
grand_mean = y.mean()                                   # 50.0
group_mean = df.groupby('department')['salary_k'].transform('mean')

between = ((group_mean - grand_mean) ** 2).sum()        # 600.0
total   = ((y          - grand_mean) ** 2).sum()        # 750.0

eta_squared = between / total                           # 0.80