Eta-Squared (η²)

An interactive explainer — effect size for categorical → numerical associations

What is Eta-Squared?

Eta-squared (η²) measures how much of the variance in a numerical variable is explained by a categorical variable. It is one of the most widely used effect-size measures in statistics and machine learning feature analysis.

The core question

"If I know which group a sample belongs to, how much better can I predict its numerical value?"

  • η² = 0 → knowing the group tells us nothing new
  • η² = 1 → knowing the group tells us everything
  • In between → category partially explains the variation

Why it matters in ML

  • Feature selection — score categorical features against numerical targets
  • EDA — quickly spot meaningful group differences before modelling
  • ANOVA companion — complements the F-test with a scale-free effect size
  • Communication — a single number in [0, 1] that stakeholders can grasp

The key formula at a glance

Eta-squared as a ratio of sums of squares η² = SSbetween SStotal = variance explained by the category total variance in the data

🎯 What you'll be able to do after this app

  • Derive η² from the ANOVA variance decomposition
  • Compute it by hand from a small data table
  • Feel why tight clusters with far-apart means give high η²
  • Interpret η² using Cohen's benchmarks (0.01 / 0.06 / 0.14)
  • Know the limitations — bias, non-linearity, sample size

Theoretical Background

η² lives inside the ANOVA (Analysis of Variance) framework. Everything flows from one elegant identity: the total variation in the data splits cleanly into a between-group piece and a within-group piece.

Setup & notation

Suppose we have a numerical response y and a categorical predictor with k levels (groups). For group i:

  • yij — the j-th observation in group i
  • ni — number of observations in group i
  • ȳi — mean of group i (group mean)
  • ȳ — mean of all observations (grand mean)
  • N = Σ ni — total sample size

The variance decomposition identity

Every single observation's deviation from the grand mean can be split into two parts:

Deviation decomposition for a single point (yij − ȳ)  =  (ȳi − ȳ)  +  (yij − ȳi)

total deviation  =  between-group part  +  within-group part

Squaring and summing over all points, the cross-term vanishes (because group deviations sum to zero inside each group), and we get the sum-of-squares identity:

The fundamental ANOVA identity SStotal  =  SSbetween  +  SSwithin

Each sum of squares, explicitly

Total SS

SST = Σi,j (yij − ȳ)2

Total variation of all points around the grand mean.

Between SS

SSB = Σi nii − ȳ)2

Variation of the group means around the grand mean — "explained" variance.

Within SS

SSW = Σi,j (yij − ȳi)2

Variation of points around their own group mean — "residual" / unexplained variance.

Eta-squared definition

Definition η²  =  SSbetween SStotal  =  1 − SSwithin SStotal

Key properties

  • Bounded: 0 ≤ η² ≤ 1
  • Scale-free: invariant to linear rescaling of y
  • Symmetric interpretation: η² = R² of a one-hot ANOVA regression
  • Link to F-statistic:  F = η² / (k−1)(1 − η²) / (N − k)

⚠ Important caveats

  • Positive bias in small samples — η² overestimates the true population effect; ω² or ηadj² correct for this.
  • Linear, monotone view of "association" — only captures differences in means, not in variances or shapes.
  • Assumptions — classical ANOVA interpretation assumes independence, roughly equal group variances (homoscedasticity), and approximately normal residuals.
  • Dependent on the number of groups — more groups ⇒ more opportunity to capture structure, can inflate η².

Visual Intuition — Three Scenarios

The same three groups, the same number of points, the same overall range — but the arrangement changes. Watch where the variance "lives": inside each group, or between the groups?

Group A
Group B
Group C
Grand mean (dashed)
Group means (solid)

Reading the plots

Look at two distances for each point:

  • Distance from the grand mean (dashed line) → contributes to SStotal
  • Distance from its group mean (solid colored bar) → contributes to SSwithin
  • Distance from group mean to grand mean → contributes to SSbetween

η² is just the proportion of the total squared distance that is "captured" by the group-mean shifts.

Interactive Calculator

Drag the sliders to reposition the three groups and adjust the within-group spread. Watch η², the sums of squares, and the variance partition update in real time.

Controls

Quick presets

Show

Eta-squared
0.625
Large effect
SSbetween
SSwithin
SStotal
Variance partition SSbetween / SStotal = η²
between 62.5%
within 37.5%

Things to try

  • Set all three group means equal → watch η² drop toward 0.
  • Spread the means apart while keeping σ small → η² climbs toward 1.
  • Keep means fixed and increase σ → η² falls (within-group noise drowns the signal).
  • Increase n with fixed means & σ → η² is roughly stable (it is a proportion, not a sum).

Worked Example — η² between two columns

Imagine a DataFrame with two columns: one categorical (department) and one numerical (salary_k). We want to quantify how strongly the department explains the salary variation — a typical feature-analysis task. Let's compute η² step by step.

The dataset — two columns, 9 rows

index department (categorical) salary_k (numerical)
0Sales35
1Sales40
2Sales45
3Engineering55
4Engineering60
5Engineering65
6HR45
7HR50
8HR55

N = 9 rows, k = 3 distinct categories

Our goal

Compute η²(department, salary_k) — the fraction of the salary variance explained by department membership.

Plan:

  • Split the numerical column by the categorical column
  • Get each group's size ni and mean ȳi
  • Get the grand mean ȳ (over all 9 rows)
  • Compute the between-group variation B
  • Compute the total variation T
  • η² = B / T
1Group the numerical column by the category

Conceptually: df.groupby('department')['salary_k']

department values of salary_k size ni group mean ȳi
Sales 35, 40, 45 3 (35+40+45)/3 = 40
Engineering 55, 60, 65 3 (55+60+65)/3 = 60
HR 45, 50, 55 3 (45+50+55)/3 = 50
2Compute the grand mean ȳ

Average of all 9 values in the numerical column, ignoring the category.

ȳ = (35 + 40 + 45 + 55 + 60 + 65 + 45 + 50 + 55) / 9
  = 450 / 9 = 50.0

Equivalent check with group means: ȳ = (3·40 + 3·60 + 3·50) / 9 = 450/9 = 50 ✓

3Compute the between-group variation B

How much does each group's mean deviate from the overall mean? Square those deviations and weight by group size.

B = Σi nii − ȳ)²
department ȳi ȳi − ȳ i − ȳ)² ni·(ȳi − ȳ)²
Sales 4040 − 50 = −10100 3·100 = 300
Engineering 6060 − 50 = +10100 3·100 = 300
HR 5050 − 50 =   00 3·0 = 0
Σ = 600
B = 300 + 300 + 0 = 600
4Compute the total variation T

How much does each single value deviate from the grand mean? Square and sum over all 9 rows — the category is ignored here.

T = Σ (y − ȳ)²
salary_k (y) y − ȳ (y − ȳ)²
35−15225
40−10100
45 −525
55 +525
60+10100
65+15225
45 −525
50  00
55 +525
Σ = 750
T = 750
5Compute η²

η² is simply the ratio — the fraction of the total variation that is captured by the group-mean shifts.

η² = B T = between-group variation total variation = 600 750 = 0.80
η²(department, salary_k) = 0.80   →   80% of the salary variation is captured by department

The same computation in Python

import pandas as pd

df = pd.DataFrame({
    'department': ['Sales']*3 + ['Engineering']*3 + ['HR']*3,
    'salary_k':   [35, 40, 45, 55, 60, 65, 45, 50, 55],
})

y          = df['salary_k']
grand_mean = y.mean()                                   # 50.0
group_mean = df.groupby('department')['salary_k'].transform('mean')

between = ((group_mean - grand_mean) ** 2).sum()        # 600.0
total   = ((y          - grand_mean) ** 2).sum()        # 750.0

eta_squared = between / total                           # 0.80