📈 Linear Regression

Interactive learning aids — Machine Learning Lecture HKA/EIT

Simple Linear Regression

One feature (x), one target (y): the model finds the best straight line ŷ = β₀ + β₁·x

🎮 Interactive Line Fitting — Heating Load Dataset

The dataset shows how Wall Area (m²) affects Heating Load (W/m²). Drag the sliders to manually fit a line, then click Best Fit to see the optimal OLS solution. Watch how residuals and error metrics respond in real time.

Scatter + Regression Line
Residual Plot (y − ŷ)
ŷ = β₀ + β₁ · x
residual eᵢ = yᵢ − ŷᵢ // actual minus predicted
SSE = Σ eᵢ² // sum of squared residuals

Least Squares & Loss Function

Why minimize the sum of squared errors instead of just the sum of errors?

⚖️ Why Squared? — Interactive Demonstration

Adjust the prediction line (ŷ = β₀ + β₁·x) using the sliders. Watch how Sum of Errors can be near zero even with a terrible fit — positive and negative residuals cancel out! SSE prevents this by squaring each error first.

6 sample points + your prediction line
Residual bars (e = y − ŷ)
Sum of Errors (Σeᵢ)
0
SSE (Σeᵢ²)
0
Conclusion: Sum of Errors is useless as a loss function because cancellation hides real errors. Squaring ensures every error contributes positively. That's why OLS minimizes SSE, not SE.

🏔️ SSE Loss Surface — Contour Map

The SSE forms a convex bowl over the parameter space (β₀, β₁). Every point on this map represents a different line; the color shows the SSE for that line. The minimum (darkest green) is where OLS finds the optimal parameters.

SSE Contour — click to explore
L(β₀, β₁) = Σ(yᵢ − β₀ − β₁·xᵢ)²

// Set partial derivatives to zero:
∂L/∂β₀ = −2 Σ(yᵢ − β₀ − β₁·xᵢ) = 0
∂L/∂β₁ = −2 Σ xᵢ(yᵢ − β₀ − β₁·xᵢ) = 0

// Matrix form (Normal Equation):
β = (XᵀX)⁻¹ · Xᵀy
Key properties: The loss surface is convex (bowl-shaped), so there is exactly one global minimum. No local minima traps — OLS always finds the best solution.

Train / Test Split

Before training, divide data into subsets. The model must generalize to unseen data.

✂️ Interactive Split — Watch Generalization

Adjust the split ratio and observe: the model is trained only on green points (training set) but evaluated on blue points (test set). A good model performs well on both.

Train (green) vs Test (blue)
80/20 Rule (Pareto Principle)
A common split: 80% training, 20% testing. Loosely inspired by the Pareto principle — 80% of effect is driven by 20% of causes. In practice, ratio depends on dataset size. There is no theoretically optimal ratio. It's always a tradeoff — more training data improves the model, more test data improves your confidence in the evaluation.
⚠️ Never use test data during training or tuning! The test set must remain unseen until final evaluation. Using it earlier causes data leakage — overly optimistic metrics that don't reflect real-world performance.
Scaling after split: Fit the scaler on training data only (fit_transform), then apply to test data (transform). This prevents information from the test set leaking into training.

Evaluation Metrics

Quantify model performance with MAE, MSE, RMSE, and R².

🧮 Interactive Metric Calculator — Step by Step

Enter actual and predicted values (comma-separated). Every metric is computed with full working shown.

📐 Formulas

MAE = (1/n) Σ|yᵢ − ŷᵢ| // avg absolute error
MSE = (1/n) Σ(yᵢ − ŷᵢ)² // avg squared error
RMSE = √MSE // same unit as y
= 1 − SSE/SST // variance explained
  SSE = Σ(yᵢ − ŷᵢ)² // sum of sq. residuals
  SST = Σ(yᵢ − ȳ)² // total sum of squares

🎯 Interpretation Guide

MetricRangePerfectMeaning
MAE[0, ∞)0Average error in original units. Robust to outliers.
MSE[0, ∞)0Penalizes large errors quadratically. Sensitive to outliers.
RMSE[0, ∞)0MSE in original units. Most commonly reported.
(−∞, 1]11 = perfect. 0 = no better than mean. <0 = worse than mean.
Heating Load example: MAE=1.38, MSE=2.80, RMSE=1.76, R²=0.90 — the model explains 90% of the variance.

Outlier Impact

A single outlier can dramatically distort the regression line and all metrics.

💥 Click to Place an Outlier

The dashed green line is the clean fit (no outlier). Click anywhere on the plot to add an outlier — watch the red line (new fit) get pulled toward it. Compare R² before and after.

Click to add/move outlier
Clean Fit R²
With Outlier R²
From the lecture: Outlier Impact
Without outlier: R² = 0.90. With outlier in training: R² = 0.21 (train), R² = −0.29 (test). A single outlier destroyed the model — R² went negative, meaning the model is worse than predicting the mean.
Why? OLS minimizes squared errors — large residuals (outliers) have disproportionate influence. Solutions: remove outliers, use robust regression (e.g., Huber loss), or apply regularization.

Multiple Linear Regression

Extend to m features: ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₘxₘ

📐 Matrix Notation & Normal Equation

// Feature matrix X (n samples, m+1 columns):
X = [[1, x₁⁽¹⁾, x₂⁽¹⁾, …, xₘ⁽¹⁾],
     [1, x₁⁽²⁾, x₂⁽²⁾, …, xₘ⁽²⁾],
     …
     [1, x₁⁽ⁿ⁾, x₂⁽ⁿ⁾, …, xₘ⁽ⁿ⁾]]

// Coefficient vector:
β = [β₀, β₁, β₂, …, βₘ]ᵀ

// Prediction:
ŷ = X · w

// Normal Equation (closed-form solution):
β = (XᵀX)⁻¹ · Xᵀy

🧪 2-Feature Prediction Surface

Adjust weights to see how the prediction surface changes. Color = predicted ŷ value.

Polynomial Features & Overfitting

Capture nonlinear patterns — but beware of model complexity.

🎮 Interactive: Polynomial Degree vs Fit Quality

The true function (dashed) is a quadratic: y = 1 + 2x + 3x² with Gaussian noise (σ=10). Increase the polynomial degree and watch the model transition from underfitting → good fit → overfitting. Pay attention to train vs test error.

# Data (np.random.seed(0)):
x = np.linspace(-3, 3, 100)
y_true = 1 + 2*x + 3*x²
y = y_true + np.random.normal(scale=10, size=x.shape)
# train_test_split(test_size=0.3, random_state=42)
# 70 train / 30 test samples
Polynomial Regression — Degree 1

🔧 How Polynomial Features Work

// Original (2 features):
[x₁, x₂]

// Degree-2 (full quadratic):
[1, x₁, x₂, x₁², x₁·x₂, x₂²]

// Degree-3 adds:
[…, x₁³, x₁²·x₂, x₁·x₂², x₂³]

// Still linear in parameters!
// ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₁² + …

💡 Kinetic Energy Example

KE = ½mv² — physics tells us V² matters. Adding this single domain-informed feature is better than brute-force polynomial expansion:

FeaturesR² TrainR² TestMSE Test
M, V (linear)0.940.9322,450,540
M, V, 1.001.00756,358
Domain knowledge beats brute force! Adding V² captures the true physics. The MSE dropped by 97%.

Bias-Variance Tradeoff

The fundamental tension in machine learning: too simple vs too complex.

📈 Complexity vs Error — The U-Shaped Curve

Training error always decreases with complexity. But test error follows a U-shape: it decreases initially (model learns the pattern) then increases (model memorizes noise). The sweet spot is where test error is minimized.

Training Error vs Test Error

🔴 Underfitting (High Bias)

Model too simple — cannot capture the pattern.

Symptoms
High training error AND high test error. Both are bad.
Fix
• Add more features
• Increase polynomial degree
• Use a more complex model

Good Fit (Balance)

Model captures the true pattern without memorizing noise.

Symptoms
Low training error. Low test error. Small generalization gap.
This is the goal
Bias² + Variance is minimized. The model generalizes.

🔵 Overfitting (High Variance)

Model too complex — memorizes training noise.

Symptoms
Very low training error but high test error. Large generalization gap.
Fix
• Regularization (L1/L2)
• More training data
• Reduce features / degree
• Cross-validation

📋 Bias-Variance Decomposition

Total Error = Bias² + Variance + Irreducible Noise

// Bias: error from wrong assumptions (too simple)
// Variance: error from sensitivity to training data (too complex)
// Noise: inherent randomness in data (cannot reduce)

Regularization

Add a penalty term to the loss function to prevent overfitting by shrinking coefficients.

📖 Overview
🔷 Geometry
📊 Coefficients
📈 Reg. Paths
🔗 ElasticNet
⚔️ Compare

🎯 What is Regularization?

Regularization adds a penalty term to the loss function to prevent overfitting. It discourages complex models by shrinking coefficients toward zero.

🟢 Ridge (L2)

Loss = RSS + λ · Σ βⱼ²

Adds the sum of squared coefficients. Shrinks toward zero but never exactly to zero.

Keeps all features Smooth shrinkage

🌿 Lasso (L1)

Loss = RSS + λ · Σ |βⱼ|

Adds the sum of absolute coefficients. Can shrink coefficients exactly to zero — automatic feature selection!

Feature selection Sparse models

💡 Key Insight

Lasso produces sparse models (some coefficients become exactly zero). Ridge keeps all features with smaller values. This comes from the geometry of their constraint regions — explore the Geometry tab to see why!

🔧 CNC Manufacturing Example (Bruchsal)

The Bruchsal subset has ~44 training samples and ~48 features after one-hot encoding (p ≈ n). This causes severe overfitting with plain linear regression:

ModelR² TrainR² TestNon-zero Coefs
Linear (no reg.)1.000.7848/48
Ridge (α=0.1)0.980.8448/48
Lasso (α=0.1)0.970.9525/48
ElasticNetCV0.980.9331/48

🔷 Geometric Interpretation

Regularization is a constrained optimization: minimize RSS subject to coefficients inside a constraint region. The solution is where the RSS contour ellipses first touch the constraint region.

How to read: Amber ellipses = RSS contours (center = OLS). Green circle = Ridge constraint. Orange diamond = Lasso constraint. Red dot = regularized solution.
🔑 The diamond's sharp corners sit on the axes (where one β = 0). Ellipses are far more likely to hit a corner → Lasso produces zeros!

🟢 Ridge — Circle

🌿 Lasso — Diamond

🧠 Why does the shape matter?

Try rotating the RSS ellipses with the slider. The diamond's corners "catch" the ellipse across many angles, giving a solution where one coefficient is exactly zero. The circle is smooth, so contact almost always has both coefficients nonzero. This is why Lasso does automatic feature selection!

📊 Coefficient Shrinkage — 8 Features

Drag λ to see how increasing regularization shrinks coefficients differently. Lasso drives small coefficients to exactly zero, while Ridge only approaches zero.

🟢 Ridge Coefficients

🌿 Lasso Coefficients

🔢 Summary

Ridge
8/8 features active — all kept
Lasso
8/8 features active

📈 Regularization Paths

How each coefficient changes as λ increases (left → right). Ridge decays smoothly. Lasso coefficients hit zero at specific λ values — showing clear feature elimination.

🟢 Ridge Path

🌿 Lasso Path

🔗 ElasticNet — Best of Both Worlds

ElasticNet was introduced by Zou & Hastie (2005) to address the limitations of both Ridge and Lasso. It combines L1 and L2 penalties in a single loss function.

📐 Cost Function

Loss = RSS + α · ρ · Σ|βⱼ| + α · (1−ρ)/2 · Σβⱼ²

where:
α (alpha) = overall regularization strength
ρ (l1_ratio) = mix between L1 and L2, ρ ∈ [0, 1]
ρ = 1.0 → pure Lasso | ρ = 0.0 → pure Ridge | 0 < ρ < 1 → blend

Why Not Just Lasso?

When features are highly correlated (e.g., the three vibration axes in our CNC dataset), Lasso has a fundamental problem:

Lasso's Problem
Lasso arbitrarily picks one correlated feature and zeros out the others. On a different random split, it might pick a different one. This makes the model unstable.

Example: If vibration_x ≈ vibration_y, Lasso might give (0.5, 0) or (0, 0.5) depending on the data split.
Ridge's Solution (partial)
Ridge distributes weight evenly among correlated features: (0.25, 0.25). This is stable, but Ridge never eliminates features — you still keep all 48 coefficients.
ElasticNet's Solution
ElasticNet either keeps or drops correlated features as a group (the "grouping effect"). It won't arbitrarily pick one over another. You get sparsity AND stability.

🔷 Geometric Interpretation

The constraint region of ElasticNet is a rounded diamond — a blend between the L1 diamond and the L2 circle:

Ridge
Circle — no corners
→ no zeros
ElasticNet
Rounded diamond
→ some zeros, stable
Lasso
Sharp diamond
→ many zeros, unstable

The rounded corners still allow sparsity (coefficients can be zero), but the smoothing from L2 makes the solution less sensitive to small changes in the data.

🧪 CNC Dataset Results (Bruchsal)

ElasticNetCV automatically searches over both α and l1_ratio via 5-fold cross-validation:

Best Parameters
α = 0.0095 | l1_ratio = 0.60
→ 60% Lasso / 40% Ridge blend
Performance
Train R² = 0.976 | Test R² = 0.932
→ Best generalization of all models
Interpretation: l1_ratio = 0.60 means the model prefers sparsity (feature selection) but still benefits from L2 stability for handling correlated features like the vibration axes. The small α means only mild regularization is needed.

🐍 Scikit-Learn Implementation

from sklearn.linear_model import ElasticNetCV

pipe = Pipeline([
('preprocessor', mergedtransf(cnc_br)),
('model', ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
alphas=None, # auto alpha path
cv=5, n_jobs=-1
))
])
pipe.fit(X_train, y_train)

# Best parameters found:
pipe['model'].alpha_ # → 0.0095
pipe['model'].l1_ratio_ # → 0.60

⚔️ Side-by-Side Comparison

AspectRidge (L2)Lasso (L1)ElasticNet (L1+L2)
Penaltyα · Σ βⱼ²α · Σ |βⱼ|α·ρ·Σ|βⱼ| + α·(1−ρ)/2·Σβⱼ²
Geometry🟢 Circle / Sphere🌿 Diamond / Polytope🔗 Rounded Diamond
Feature Selection❌ No — keeps all✅ Yes — zeros out some✅ Yes — controlled by ρ
Solution FormClosed formIterativeIterative
Correlated FeaturesShares weight evenly ✅Picks one, zeros others ⚠️Groups them together ✅
Hyperparametersααα + ρ (l1_ratio)
Best WhenMany small effectsFew important featuresMixed + correlated features

🤔 When to use which?

Use Ridge when most features contribute and you want to shrink all evenly. Great with correlated features, but no feature selection.

Use Lasso when many features are irrelevant and you want automatic feature selection. Caution: unstable with correlated features.

Use ElasticNet when you want sparsity (like Lasso) AND stability with correlated features (like Ridge). Use ElasticNetCV to automatically tune both α and ρ.

🧪 Quick Quiz

K-Fold Cross-Validation

Every data point gets to be in the validation set exactly once — robust model evaluation.

🎮 Interactive K-Fold Visualization

Each row shows one fold iteration. The red block is the validation fold; green blocks are training. After K iterations, every data point has been validated exactly once.

K-Fold Cross-Validation

📋 Why Cross-Validation?

Problem with single split
• What if the validation set doesn't represent the full distribution?
• With limited data, a single split wastes valuable training samples
• Results depend on which data ended up in which set
K-Fold solves this
• Every sample is validated exactly once
• Final score = average of K scores → more reliable
• Reduces variance of the performance estimate
• Standard: K=5 or K=10 in practice

Grid Search & Hyperparameter Tuning

Systematically find the best hyperparameters using cross-validation.

🔍 Parameters vs Hyperparameters

Parameters (learned from data)
Determined during training automatically.
Examples: weights β₀, β₁, …, βₘ
You don't choose these — the algorithm finds them.
Hyperparameters (set by you)
Set before training — control the learning process.
Examples: λ (regularization strength), polynomial degree, K in KNN, learning rate
How to choose? → Grid Search + Cross-Validation!

🎮 Interactive Grid Search — Find Best λ for Ridge Regression

This simulates grid search with 5-fold CV for 8 different λ values. Each bar shows the average CV R² across 5 folds. The best λ is highlighted. Click any bar to see that λ's fold-by-fold scores.

Grid Search: Average 5-Fold CV R² per λ

📋 Full Pipeline: Grid Search + K-Fold CV — Step by Step

This diagram shows the complete workflow from raw data to final deployed model. Follow the numbered steps and the arrows. The test set stays sealed until the very last step.

Complete Hyperparameter Tuning Pipeline

⚠️ Why Not Use the Test Set for Tuning?

If you evaluate hyperparameters on the test set, you're indirectly training on it — choosing hyperparameters that happen to work well on those specific test points. The test set becomes "seen" data, and your final performance estimate is overly optimistic.

The test set must remain completely sealed until the final evaluation. This is why we need a separate validation mechanism — and K-fold CV is the most robust approach.

💰 Computational Cost

Grid search trains K × G models, where G = number of grid points and K = folds.

// Example: 8 λ values × 5 folds = 40 models!
// With 2 hyperparameters (e.g., λ and degree):
// 8 × 5 × 5 folds = 200 models

// Alternatives:
Random Search: sample randomly from param space
Bayesian Optimization: smart sequential search
Evolutionary Algorithms: population-based search