What is Eta-Squared?
Eta-squared (η²) measures how much of the variance in a numerical variable is explained by a categorical variable. It is one of the most widely used effect-size measures in statistics and machine learning feature analysis.
The core question
"If I know which group a sample belongs to, how much better can I predict its numerical value?"
- η² = 0 → knowing the group tells us nothing new
- η² = 1 → knowing the group tells us everything
- In between → category partially explains the variation
Why it matters in ML
- Feature selection — score categorical features against numerical targets
- EDA — quickly spot meaningful group differences before modelling
- ANOVA companion — complements the F-test with a scale-free effect size
- Communication — a single number in [0, 1] that stakeholders can grasp
The key formula at a glance
🎯 What you'll be able to do after this app
- Derive η² from the ANOVA variance decomposition
- Compute it by hand from a small data table
- Feel why tight clusters with far-apart means give high η²
- Interpret η² using Cohen's benchmarks (0.01 / 0.06 / 0.14)
- Know the limitations — bias, non-linearity, sample size
Theoretical Background
η² lives inside the ANOVA (Analysis of Variance) framework. Everything flows from one elegant identity: the total variation in the data splits cleanly into a between-group piece and a within-group piece.
Setup & notation
Suppose we have a numerical response y and a categorical predictor with k levels (groups). For group i:
- yij — the j-th observation in group i
- ni — number of observations in group i
- ȳi — mean of group i (group mean)
- ȳ — mean of all observations (grand mean)
- N = Σ ni — total sample size
The variance decomposition identity
Every single observation's deviation from the grand mean can be split into two parts:
total deviation = between-group part + within-group part
Squaring and summing over all points, the cross-term vanishes (because group deviations sum to zero inside each group), and we get the sum-of-squares identity:
Each sum of squares, explicitly
Total SS
Total variation of all points around the grand mean.
Between SS
Variation of the group means around the grand mean — "explained" variance.
Within SS
Variation of points around their own group mean — "residual" / unexplained variance.
Eta-squared definition
Key properties
- Bounded: 0 ≤ η² ≤ 1
- Scale-free: invariant to linear rescaling of y
- Symmetric interpretation: η² = R² of a one-hot ANOVA regression
- Link to F-statistic: F = η² / (k−1)(1 − η²) / (N − k)
⚠ Important caveats
- Positive bias in small samples — η² overestimates the true population effect; ω² or ηadj² correct for this.
- Linear, monotone view of "association" — only captures differences in means, not in variances or shapes.
- Assumptions — classical ANOVA interpretation assumes independence, roughly equal group variances (homoscedasticity), and approximately normal residuals.
- Dependent on the number of groups — more groups ⇒ more opportunity to capture structure, can inflate η².
Visual Intuition — Three Scenarios
The same three groups, the same number of points, the same overall range — but the arrangement changes. Watch where the variance "lives": inside each group, or between the groups?
All variance is within groups.
Variance is split between & within.
Nearly all variance is between groups.
Reading the plots
Look at two distances for each point:
- Distance from the grand mean (dashed line) → contributes to SStotal
- Distance from its group mean (solid colored bar) → contributes to SSwithin
- Distance from group mean to grand mean → contributes to SSbetween
η² is just the proportion of the total squared distance that is "captured" by the group-mean shifts.
Interactive Calculator
Drag the sliders to reposition the three groups and adjust the within-group spread. Watch η², the sums of squares, and the variance partition update in real time.
Controls
Quick presets
Show
Things to try
- Set all three group means equal → watch η² drop toward 0.
- Spread the means apart while keeping σ small → η² climbs toward 1.
- Keep means fixed and increase σ → η² falls (within-group noise drowns the signal).
- Increase n with fixed means & σ → η² is roughly stable (it is a proportion, not a sum).
Worked Example — η² between two columns
Imagine a DataFrame with two columns: one categorical (department) and one
numerical (salary_k). We want to quantify how strongly the department
explains the salary variation — a typical feature-analysis task. Let's compute η² step by step.
The dataset — two columns, 9 rows
| index | department (categorical) | salary_k (numerical) |
|---|---|---|
| 0 | Sales | 35 |
| 1 | Sales | 40 |
| 2 | Sales | 45 |
| 3 | Engineering | 55 |
| 4 | Engineering | 60 |
| 5 | Engineering | 65 |
| 6 | HR | 45 |
| 7 | HR | 50 |
| 8 | HR | 55 |
N = 9 rows, k = 3 distinct categories
Our goal
Compute η²(department, salary_k) — the fraction of the salary variance
explained by department membership.
Plan:
- Split the numerical column by the categorical column
- Get each group's size ni and mean ȳi
- Get the grand mean ȳ (over all 9 rows)
- Compute the between-group variation B
- Compute the total variation T
- η² = B / T
Conceptually: df.groupby('department')['salary_k']
| department | values of salary_k | size ni | group mean ȳi |
|---|---|---|---|
| Sales | 35, 40, 45 | 3 | (35+40+45)/3 = 40 |
| Engineering | 55, 60, 65 | 3 | (55+60+65)/3 = 60 |
| HR | 45, 50, 55 | 3 | (45+50+55)/3 = 50 |
Average of all 9 values in the numerical column, ignoring the category.
Equivalent check with group means: ȳ = (3·40 + 3·60 + 3·50) / 9 = 450/9 = 50 ✓
How much does each group's mean deviate from the overall mean? Square those deviations and weight by group size.
| department | ȳi | ȳi − ȳ | (ȳi − ȳ)² | ni·(ȳi − ȳ)² |
|---|---|---|---|---|
| Sales | 40 | 40 − 50 = −10 | 100 | 3·100 = 300 |
| Engineering | 60 | 60 − 50 = +10 | 100 | 3·100 = 300 |
| HR | 50 | 50 − 50 = 0 | 0 | 3·0 = 0 |
| Σ = | 600 | |||
How much does each single value deviate from the grand mean? Square and sum over all 9 rows — the category is ignored here.
| salary_k (y) | y − ȳ | (y − ȳ)² |
|---|---|---|
| 35 | −15 | 225 |
| 40 | −10 | 100 |
| 45 | −5 | 25 |
| 55 | +5 | 25 |
| 60 | +10 | 100 |
| 65 | +15 | 225 |
| 45 | −5 | 25 |
| 50 | 0 | 0 |
| 55 | +5 | 25 |
| Σ = | 750 | |
η² is simply the ratio — the fraction of the total variation that is captured by the group-mean shifts.
The same computation in Python
import pandas as pd df = pd.DataFrame({ 'department': ['Sales']*3 + ['Engineering']*3 + ['HR']*3, 'salary_k': [35, 40, 45, 55, 60, 65, 45, 50, 55], }) y = df['salary_k'] grand_mean = y.mean() # 50.0 group_mean = df.groupby('department')['salary_k'].transform('mean') between = ((group_mean - grand_mean) ** 2).sum() # 600.0 total = ((y - grand_mean) ** 2).sum() # 750.0 eta_squared = between / total # 0.80