Data Preparation & Feature Engineering

ML Project Workflow

A typical ML project follows interconnected steps. Click each phase to learn more.

Modified by me: Source: successfactory management coaching gmbh

🧠 Quick check: Which step comes right before model training?

Data Types

Understanding data types is the first step in any data analysis.

🌳 Data Types Hierarchy

📊 Quantitative (Numerical)

Can be expressed in numerical values, suitable for statistical analysis.

Continuous: Can take any value within a range.

Temperature 🌡️Voltage ⚡Height 📏Speed 🚗Blood Pressure 💉Stock Price 📈

Discrete: Can take only finite, countable values.

Dice Roll 🎲Students 🧑‍🎓Cars in Lot 🅿️Complaints/Day 📝Website Clicks 🖱️

🏷️ Qualitative (Categorical)

Can't be measured numerically — sorted by category, not by number.

Nominal: Categories with no specific order.

Blood Type 🩸Color 🎨Car Brand 🚘Job Title 👔Marital Status 💍Eye Color 👁️

Ordinal: Has a natural order or ranking.

Education 🎓Rating ⭐Satisfaction 📊T-Shirt Size 👕Pain Level 🏥Military Rank 🎖️

💡 Why does this matter? Data type determines which visualizations, statistics, and encoding methods you can use. Treating ordinal data as nominal (or vice versa) leads to incorrect models.

🎯 Classify these features

Exploratory Data Analysis (EDA)

EDA means getting to know your data before modelling. It guides cleaning, feature engineering, and gives you intuition about patterns and problems in the dataset. We use the Heart Failure Prediction Dataset (50 patients, 11 features) as our running example.

📊 Descriptive Statistics — df.describe()

This is always your first look at numerical features. Select a feature to see its full summary statistics — count, mean, standard deviation, min, quartiles, and max. The histogram below shows the shape of the distribution, with a red line marking the mean:

Numerical Feature

Distribution Histogram

📦 Interactive Box Plot with Outlier Detection

A box plot visualizes the five-number summary (min, Q1, median, Q3, max) and identifies outliers using the IQR rule. Any point below Q1 − 1.5×IQR or above Q3 + 1.5×IQR is flagged as an outlier.

IQR = Q3 − Q1
Lower fence = Q1 − 1.5 × IQR
Upper fence = Q3 + 1.5 × IQR // points outside → outliers

Feature

💡

Why it matters: Outliers can heavily skew mean-based imputation and distance-based models (KNN, SVM). Box plots help you decide: is this a data error (remove), a rare but valid case (keep), or an extreme value to cap (winsorize)?

🔵 Scatter Plot — Correlation Explorer

Scatter plots show the relationship between two numerical features. Color points by the target variable to reveal patterns. A Pearson correlation close to ±1 means a strong linear relationship; near 0 means no linear relationship.

X-Axis

Y-Axis

Color by

📐

Pearson correlation formula: r = Σ(xᵢ − μₓ)(yᵢ − μᵧ) / (n × σₓ × σᵧ). Values: +1 = perfect positive, 0 = none, −1 = perfect negative. Try MaxHR vs Age — you should see a moderate negative correlation.

📊 Segmentation — Categorical × Numerical

How does a numerical feature differ across categories? Grouped box plots are ideal for this — they show the median, quartiles, and outliers for each group side by side. If the boxes barely overlap, the categorical feature is likely predictive.

Numerical Feature

Group by

Grouped Box Plots — compare distributions across categories

💡

Key insight: Try "MaxHR grouped by HeartDisease" — the boxes should have minimal overlap, showing patients with heart disease tend to have lower maximum heart rates. Compare with "Cholesterol grouped by HeartDisease" — heavily overlapping boxes mean cholesterol alone is a weaker predictor.

📋 Categorical Feature Summary

For categorical features, we look at value counts and frequencies instead of mean/std. An imbalanced category distribution (e.g., 90% one class) can cause models to be biased.

Categorical Feature

Value Counts

💾 The Heart Failure Dataset at a Glance

50 patients (sample), 11 features, binary target. Here are the first rows:

Features

Numerical: Age, RestingBP, Cholesterol, MaxHR, Oldpeak | Categorical: Gender, ChestPainType, FastingBS, RestingECG, ExerciseAngina | Target: HeartDisease (0/1)

Data Quality Assessment

Five dimensions of data quality.

🔍 Spot the issues

Data Cleaning

Fix quality problems step by step.

🧹 Interactive Cleaning Simulator

Missing Value Imputation

Replace missing values. Method affects distributions.

🧪 Try strategies

Dataset: Method:

📈 Distribution: Before vs After

⚠️ Mean reduces variance. Zero introduces bias. Remove loses data.

Feature Engineering

The art of transforming raw data into features that help models learn better.

💡 What is Feature Engineering?

Feature engineering transforms raw data into better inputs for ML models. Think of it as translating domain knowledge into numbers a model can understand. A doctor doesn't just look at raw blood pressure — they compute pulse pressure, MAP, and risk categories. Feature engineering teaches models to do the same.

❓ Why It Matters

🎯 Selection

🧮 Create Features

🔄 Transforms

📊 Correlation

🔗 Logic Combos

🎯 The Goal

Better features = better models. Even the best algorithm can't learn from bad features.

Consider predicting heart disease with raw data: age, blood pressure, cholesterol. A doctor would also compute:

Pulse Pressure = Systolic − Diastolic → arterial stiffness
HR % of Max = MaxHR / (220 − Age) → cardiac fitness
BMI = weight / height² → body composition

These derived features encode expert knowledge that raw numbers alone can't express.

📊 Impact on Accuracy

Adding well-chosen features improves accuracy. But adding too many hurts — that's overfitting.

🔑 Four Pillars of Feature Engineering

1. Selection
Remove irrelevant/redundant features → less noise, less overfitting

2. Creation
Combine features (A+B, A×B, A/B) → capture hidden patterns

3. Transformation
Apply math (log, √x) → fix skewed distributions

4. Domain Knowledge
Use expert insight (BMI, MAP) → meaningful derived features

🧠 Quiz: Which is a created feature (not raw data)?

🎯 Feature Selection — Choosing the Right Inputs

Not all features help. Some are irrelevant (patient ID), some redundant (height in cm AND inches), some add noise.

Approach	How	Example
Filter	Rank by statistics	Correlation, chi-squared
Wrapper	Try subsets, evaluate	Forward/backward selection
Embedded	Model learns importance	Lasso, tree feature importance

🧪 Interactive: Build your feature set for Heart Disease prediction

Click features to toggle. Watch how accuracy and complexity change. Try to find the optimal subset!

🧠 Quiz: What happens if you use ALL features including irrelevant ones?

🧮 Creating New Features — Combining Existing Data

You can build new features using arithmetic operations on existing ones. This helps models see relationships hidden in individual features.

Operation	Formula	Use Case
Sum	A + B	Total score, combined effect
Difference	A − B	Pulse pressure, profit margin
Ratio	A / B	BMI, price-per-unit
Product	A × B	Interaction effect
Polynomial	A²	Non-linear relationships

🩺 Live Example: Blood Pressure

Move the sliders and watch derived medical features update in real time.

Systolic BP (A): 120 mmHg Diastolic BP (B): 80 mmHg

A + B (sum)

200

A − B (pulse pr.)

A / B (ratio)

1.50

MAP

📈 Feature Values Visualized

🏥 What Do These Mean?

Pulse Pressure (A−B): Normal 30–50 mmHg. High → stiff arteries.
MAP (Mean Arterial Pressure = B + ⅓(A−B)): Normal 70–100. Organs need MAP > 60.
Ratio (A/B): ~1.5 is normal. Higher → isolated systolic hypertension.

🔄 Feature Transformations — Changing the Shape of Data

Transformations change a feature's distribution shape. This matters because many algorithms assume roughly symmetric distributions.

Transform	Effect	When to Use
log(x)	Compresses large values	Right-skewed data (income, population)
√x	Mild compression	Count data, moderate skew
x²	Amplifies differences	Need to emphasize large values
1/x	Inverts scale	Rate/frequency → period

🧪 Transform Explorer

Input values (comma-separated, positive): Transformation:

📊 Before vs After

📊 Feature Correlation — Finding Redundancy

Correlation (r) measures how two features move together: r = +1 (same direction), r = −1 (opposite), r = 0 (no relationship). Highly correlated features are redundant — keeping both wastes model capacity.

🔥 Correlation Heatmap

Click any cell to see the scatter plot for that pair.

📈 Click a cell to explore

Click any cell in the heatmap.

🔗 Logical Feature Combinations — Interaction Effects

For binary features (0/1), you can create new features using logical operations. This captures interaction effects — e.g., "High BP AND Smoker" is a much stronger heart disease predictor than either alone.

Why? A linear model computes w₁x₁ + w₂x₂. It cannot learn that x₁=1 AND x₂=1 has a special combined effect. Adding x₁·x₂ as a new feature solves this.

🎮 Toggle Risk Factors

x₁ (High BP)

x₂ (Smoker)

Full Truth Table

🧠 Quiz: Can a linear model (w₁x₁ + w₂x₂) represent AND logic?

Feature Scaling

Different scales distort distance-based algorithms.

📐 Why Scale?

🔄 Calculator

📊 Compare

🎯 KNN Demo

Distance Without Scaling

Pressure (~1013) dominates temperature (~20).

Which algorithms need scaling?

Algorithm	Scaling?	Why
KNN	✅ Yes	Distance-based
SVM	✅ Yes	Margin depends on scale
Linear Reg.	✅ Yes	Gradient descent
Neural Nets	✅ Yes	Activations
Decision Tree	❌ No	Threshold splits
Random Forest	❌ No	Tree-based

🔄 Interactive Scaler Calculator

Raw value: 150 Feature min: 50 Feature max: 250 Feature mean (μ): 130 Feature std (σ): 40

Min-Max [0,1]

0.500

Z-Score

0.500

Min-Max: (x−min)/(max−min)
Z-Score: (x−μ)/σ

Visual Position

📊 Before vs After — All Features on Same Scale

Method:

Key observation: Before scaling, features have vastly different ranges (Age: ~30–77, RestBP: ~90–180, Chol: ~100–400). After scaling, all features share the same range, so no single feature dominates distance calculations. Compare how the dots spread differently on the left vs. right!

🎯 KNN: Scaling Impact

See how scaling changes which neighbors are closest.

Without Scaling

With Min-Max Scaling

Categorical Feature Encoding

ML models need numbers. How you encode matters.

🔢 One-Hot

📏 Ordinal

🏷️ Label

⚖️ Compare

🧠 Quiz

🔢 One-Hot Encoding

Categories:

Each category → binary column. High cardinality → many sparse columns.

Visual: Equidistant

All categories equidistant (√2). No ordering.

📏 Ordinal Encoding

Ordered (low→high):

Number Line

⚠️ Don't use for nominal data!

🏷️ Label Encoding

Categories:

⚠️ Arbitrary numbers → false ordering!

When to use what?

Method	Best For	Risk
One-Hot	Nominal	High dim
Ordinal	Ordinal	Equal spacing
Label	Trees / binary	False order

⚖️ Distance Comparison

Categories:

ML Project Workflow

Data Types

🌳 Data Types Hierarchy

📊 Quantitative (Numerical)

🏷️ Qualitative (Categorical)

🎯 Classify these features

Exploratory Data Analysis (EDA)

Data Quality Assessment

🔍 Spot the issues

Data Cleaning

🧹 Interactive Cleaning Simulator

Missing Value Imputation

🧪 Try strategies

📈 Distribution: Before vs After

Feature Engineering

💡 What is Feature Engineering?

🎯 The Goal

📊 Impact on Accuracy

🔑 Four Pillars of Feature Engineering

🎯 Feature Selection — Choosing the Right Inputs

🧪 Interactive: Build your feature set for Heart Disease prediction

🧮 Creating New Features — Combining Existing Data

🩺 Live Example: Blood Pressure

📈 Feature Values Visualized

🏥 What Do These Mean?

🔄 Feature Transformations — Changing the Shape of Data

🧪 Transform Explorer

📊 Before vs After

📊 Feature Correlation — Finding Redundancy

🔥 Correlation Heatmap

📈 Click a cell to explore

🔗 Logical Feature Combinations — Interaction Effects

🎮 Toggle Risk Factors

Full Truth Table

Feature Scaling

Distance Without Scaling

Which algorithms need scaling?

🔄 Interactive Scaler Calculator

Visual Position

📊 Before vs After — All Features on Same Scale

🎯 KNN: Scaling Impact

Without Scaling

With Min-Max Scaling

Categorical Feature Encoding

🔢 One-Hot Encoding

Visual: Equidistant

📏 Ordinal Encoding

Number Line

🏷️ Label Encoding

When to use what?

⚖️ Distance Comparison

🧠 Encoding Quiz