Simple Linear Regression
One feature (x), one target (y): the model finds the best straight line ŷ = β₀ + β₁·x
🎮 Interactive Line Fitting — Heating Load Dataset
The dataset shows how Wall Area (m²) affects Heating Load (W/m²). Drag the sliders to manually fit a line, then click Best Fit to see the optimal OLS solution. Watch how residuals and error metrics respond in real time.
residual eᵢ = yᵢ − ŷᵢ // actual minus predicted
SSE = Σ eᵢ² // sum of squared residuals
Least Squares & Loss Function
Why minimize the sum of squared errors instead of just the sum of errors?
⚖️ Why Squared? — Interactive Demonstration
Adjust the prediction line (ŷ = β₀ + β₁·x) using the sliders. Watch how Sum of Errors can be near zero even with a terrible fit — positive and negative residuals cancel out! SSE prevents this by squaring each error first.
🏔️ SSE Loss Surface — Contour Map
The SSE forms a convex bowl over the parameter space (β₀, β₁). Every point on this map represents a different line; the color shows the SSE for that line. The minimum (darkest green) is where OLS finds the optimal parameters.
// Set partial derivatives to zero:
∂L/∂β₀ = −2 Σ(yᵢ − β₀ − β₁·xᵢ) = 0
∂L/∂β₁ = −2 Σ xᵢ(yᵢ − β₀ − β₁·xᵢ) = 0
// Matrix form (Normal Equation):
β = (XᵀX)⁻¹ · Xᵀy
Train / Test Split
Before training, divide data into subsets. The model must generalize to unseen data.
✂️ Interactive Split — Watch Generalization
Adjust the split ratio and observe: the model is trained only on green points (training set) but evaluated on blue points (test set). A good model performs well on both.
fit_transform), then apply to test data (transform). This prevents information from the test set leaking into training.Evaluation Metrics
Quantify model performance with MAE, MSE, RMSE, and R².
🧮 Interactive Metric Calculator — Step by Step
Enter actual and predicted values (comma-separated). Every metric is computed with full working shown.
📐 Formulas
MSE = (1/n) Σ(yᵢ − ŷᵢ)² // avg squared error
RMSE = √MSE // same unit as y
R² = 1 − SSE/SST // variance explained
SSE = Σ(yᵢ − ŷᵢ)² // sum of sq. residuals
SST = Σ(yᵢ − ȳ)² // total sum of squares
🎯 Interpretation Guide
| Metric | Range | Perfect | Meaning |
|---|---|---|---|
| MAE | [0, ∞) | 0 | Average error in original units. Robust to outliers. |
| MSE | [0, ∞) | 0 | Penalizes large errors quadratically. Sensitive to outliers. |
| RMSE | [0, ∞) | 0 | MSE in original units. Most commonly reported. |
| R² | (−∞, 1] | 1 | 1 = perfect. 0 = no better than mean. <0 = worse than mean. |
Outlier Impact
A single outlier can dramatically distort the regression line and all metrics.
💥 Click to Place an Outlier
The dashed green line is the clean fit (no outlier). Click anywhere on the plot to add an outlier — watch the red line (new fit) get pulled toward it. Compare R² before and after.
Multiple Linear Regression
Extend to m features: ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₘxₘ
📐 Matrix Notation & Normal Equation
X = [[1, x₁⁽¹⁾, x₂⁽¹⁾, …, xₘ⁽¹⁾],
[1, x₁⁽²⁾, x₂⁽²⁾, …, xₘ⁽²⁾],
…
[1, x₁⁽ⁿ⁾, x₂⁽ⁿ⁾, …, xₘ⁽ⁿ⁾]]
// Coefficient vector:
β = [β₀, β₁, β₂, …, βₘ]ᵀ
// Prediction:
ŷ = X · w
// Normal Equation (closed-form solution):
β = (XᵀX)⁻¹ · Xᵀy
🧪 2-Feature Prediction Surface
Adjust weights to see how the prediction surface changes. Color = predicted ŷ value.
Polynomial Features & Overfitting
Capture nonlinear patterns — but beware of model complexity.
🎮 Interactive: Polynomial Degree vs Fit Quality
The true function (dashed) is a quadratic: y = 1 + 2x + 3x² with Gaussian noise (σ=10). Increase the polynomial degree and watch the model transition from underfitting → good fit → overfitting. Pay attention to train vs test error.
x = np.linspace(-3, 3, 100)
y_true = 1 + 2*x + 3*x²
y = y_true + np.random.normal(scale=10, size=x.shape)
# train_test_split(test_size=0.3, random_state=42)
# 70 train / 30 test samples
🔧 How Polynomial Features Work
[x₁, x₂]
// Degree-2 (full quadratic):
[1, x₁, x₂, x₁², x₁·x₂, x₂²]
// Degree-3 adds:
[…, x₁³, x₁²·x₂, x₁·x₂², x₂³]
// Still linear in parameters!
// ŷ = β₀ + β₁x₁ + β₂x₂ + β₃x₁² + …
💡 Kinetic Energy Example
KE = ½mv² — physics tells us V² matters. Adding this single domain-informed feature is better than brute-force polynomial expansion:
| Features | R² Train | R² Test | MSE Test |
|---|---|---|---|
| M, V (linear) | 0.94 | 0.93 | 22,450,540 |
| M, V, V² | 1.00 | 1.00 | 756,358 |
Bias-Variance Tradeoff
The fundamental tension in machine learning: too simple vs too complex.
📈 Complexity vs Error — The U-Shaped Curve
Training error always decreases with complexity. But test error follows a U-shape: it decreases initially (model learns the pattern) then increases (model memorizes noise). The sweet spot is where test error is minimized.
🔴 Underfitting (High Bias)
Model too simple — cannot capture the pattern.
• Increase polynomial degree
• Use a more complex model
✅ Good Fit (Balance)
Model captures the true pattern without memorizing noise.
🔵 Overfitting (High Variance)
Model too complex — memorizes training noise.
• More training data
• Reduce features / degree
• Cross-validation
📋 Bias-Variance Decomposition
// Bias: error from wrong assumptions (too simple)
// Variance: error from sensitivity to training data (too complex)
// Noise: inherent randomness in data (cannot reduce)
Regularization
Add a penalty term to the loss function to prevent overfitting by shrinking coefficients.
🎯 What is Regularization?
Regularization adds a penalty term to the loss function to prevent overfitting. It discourages complex models by shrinking coefficients toward zero.
🟢 Ridge (L2)
Adds the sum of squared coefficients. Shrinks toward zero but never exactly to zero.
Keeps all features Smooth shrinkage
🌿 Lasso (L1)
Adds the sum of absolute coefficients. Can shrink coefficients exactly to zero — automatic feature selection!
Feature selection Sparse models
💡 Key Insight
Lasso produces sparse models (some coefficients become exactly zero). Ridge keeps all features with smaller values. This comes from the geometry of their constraint regions — explore the Geometry tab to see why!
🔧 CNC Manufacturing Example (Bruchsal)
The Bruchsal subset has ~44 training samples and ~48 features after one-hot encoding (p ≈ n). This causes severe overfitting with plain linear regression:
| Model | R² Train | R² Test | Non-zero Coefs |
|---|---|---|---|
| Linear (no reg.) | 1.00 | 0.78 | 48/48 |
| Ridge (α=0.1) | 0.98 | 0.84 | 48/48 |
| Lasso (α=0.1) | 0.97 | 0.95 | 25/48 |
| ElasticNetCV | 0.98 | 0.93 | 31/48 |
🔷 Geometric Interpretation
Regularization is a constrained optimization: minimize RSS subject to coefficients inside a constraint region. The solution is where the RSS contour ellipses first touch the constraint region.
🔑 The diamond's sharp corners sit on the axes (where one β = 0). Ellipses are far more likely to hit a corner → Lasso produces zeros!
🟢 Ridge — Circle
🌿 Lasso — Diamond
🧠 Why does the shape matter?
Try rotating the RSS ellipses with the slider. The diamond's corners "catch" the ellipse across many angles, giving a solution where one coefficient is exactly zero. The circle is smooth, so contact almost always has both coefficients nonzero. This is why Lasso does automatic feature selection!
📊 Coefficient Shrinkage — 8 Features
Drag λ to see how increasing regularization shrinks coefficients differently. Lasso drives small coefficients to exactly zero, while Ridge only approaches zero.
🟢 Ridge Coefficients
🌿 Lasso Coefficients
🔢 Summary
📈 Regularization Paths
How each coefficient changes as λ increases (left → right). Ridge decays smoothly. Lasso coefficients hit zero at specific λ values — showing clear feature elimination.
🟢 Ridge Path
🌿 Lasso Path
🔗 ElasticNet — Best of Both Worlds
ElasticNet was introduced by Zou & Hastie (2005) to address the limitations of both Ridge and Lasso. It combines L1 and L2 penalties in a single loss function.
📐 Cost Function
where:
α (alpha) = overall regularization strength
ρ (l1_ratio) = mix between L1 and L2, ρ ∈ [0, 1]
ρ = 1.0 → pure Lasso | ρ = 0.0 → pure Ridge | 0 < ρ < 1 → blend
❓ Why Not Just Lasso?
When features are highly correlated (e.g., the three vibration axes in our CNC dataset), Lasso has a fundamental problem:
Example: If vibration_x ≈ vibration_y, Lasso might give (0.5, 0) or (0, 0.5) depending on the data split.
🔷 Geometric Interpretation
The constraint region of ElasticNet is a rounded diamond — a blend between the L1 diamond and the L2 circle:
Circle — no corners
→ no zeros
Rounded diamond
→ some zeros, stable
Sharp diamond
→ many zeros, unstable
The rounded corners still allow sparsity (coefficients can be zero), but the smoothing from L2 makes the solution less sensitive to small changes in the data.
🧪 CNC Dataset Results (Bruchsal)
ElasticNetCV automatically searches over both α and l1_ratio via 5-fold cross-validation:
→ 60% Lasso / 40% Ridge blend
→ Best generalization of all models
🐍 Scikit-Learn Implementation
pipe = Pipeline([
('preprocessor', mergedtransf(cnc_br)),
('model', ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
alphas=None, # auto alpha path
cv=5, n_jobs=-1
))
])
pipe.fit(X_train, y_train)
# Best parameters found:
pipe['model'].alpha_ # → 0.0095
pipe['model'].l1_ratio_ # → 0.60
⚔️ Side-by-Side Comparison
| Aspect | Ridge (L2) | Lasso (L1) | ElasticNet (L1+L2) |
|---|---|---|---|
| Penalty | α · Σ βⱼ² | α · Σ |βⱼ| | α·ρ·Σ|βⱼ| + α·(1−ρ)/2·Σβⱼ² |
| Geometry | 🟢 Circle / Sphere | 🌿 Diamond / Polytope | 🔗 Rounded Diamond |
| Feature Selection | ❌ No — keeps all | ✅ Yes — zeros out some | ✅ Yes — controlled by ρ |
| Solution Form | Closed form | Iterative | Iterative |
| Correlated Features | Shares weight evenly ✅ | Picks one, zeros others ⚠️ | Groups them together ✅ |
| Hyperparameters | α | α | α + ρ (l1_ratio) |
| Best When | Many small effects | Few important features | Mixed + correlated features |
🤔 When to use which?
Use Ridge when most features contribute and you want to shrink all evenly. Great with correlated features, but no feature selection.
Use Lasso when many features are irrelevant and you want automatic feature selection. Caution: unstable with correlated features.
Use ElasticNet when you want sparsity (like Lasso) AND stability with correlated features (like Ridge). Use ElasticNetCV to automatically tune both α and ρ.
🧪 Quick Quiz
K-Fold Cross-Validation
Every data point gets to be in the validation set exactly once — robust model evaluation.
🎮 Interactive K-Fold Visualization
Each row shows one fold iteration. The red block is the validation fold; green blocks are training. After K iterations, every data point has been validated exactly once.
📋 Why Cross-Validation?
• With limited data, a single split wastes valuable training samples
• Results depend on which data ended up in which set
• Final score = average of K scores → more reliable
• Reduces variance of the performance estimate
• Standard: K=5 or K=10 in practice
Grid Search & Hyperparameter Tuning
Systematically find the best hyperparameters using cross-validation.
🔍 Parameters vs Hyperparameters
Examples: weights β₀, β₁, …, βₘ
You don't choose these — the algorithm finds them.
Examples: λ (regularization strength), polynomial degree, K in KNN, learning rate
How to choose? → Grid Search + Cross-Validation!
🎮 Interactive Grid Search — Find Best λ for Ridge Regression
This simulates grid search with 5-fold CV for 8 different λ values. Each bar shows the average CV R² across 5 folds. The best λ is highlighted. Click any bar to see that λ's fold-by-fold scores.
📋 Full Pipeline: Grid Search + K-Fold CV — Step by Step
This diagram shows the complete workflow from raw data to final deployed model. Follow the numbered steps and the arrows. The test set stays sealed until the very last step.
⚠️ Why Not Use the Test Set for Tuning?
If you evaluate hyperparameters on the test set, you're indirectly training on it — choosing hyperparameters that happen to work well on those specific test points. The test set becomes "seen" data, and your final performance estimate is overly optimistic.
💰 Computational Cost
Grid search trains K × G models, where G = number of grid points and K = folds.
// With 2 hyperparameters (e.g., λ and degree):
// 8 × 5 × 5 folds = 200 models
// Alternatives:
Random Search: sample randomly from param space
Bayesian Optimization: smart sequential search
Evolutionary Algorithms: population-based search