๐Ÿ”ง Data Preparation & Feature Engineering

Interactive learning companion โ€” Hochschule Karlsruhe

ML Project Workflow

A typical ML project follows interconnected steps. Click each phase to learn more.

ML Workflow

Modified by me: Source: successfactory management coaching gmbh

๐Ÿง  Quick check: Which step comes right before model training?

Data Types

Understanding data types is the first step in any data analysis.

๐ŸŒณ Data Types Hierarchy

๐Ÿ“Š Quantitative (Numerical)

Can be expressed in numerical values, suitable for statistical analysis.

Continuous: Can take any value within a range.

Temperature ๐ŸŒก๏ธVoltage โšกHeight ๐Ÿ“Speed ๐Ÿš—Blood Pressure ๐Ÿ’‰Stock Price ๐Ÿ“ˆ

Discrete: Can take only finite, countable values.

Dice Roll ๐ŸŽฒStudents ๐Ÿง‘โ€๐ŸŽ“Cars in Lot ๐Ÿ…ฟ๏ธComplaints/Day ๐Ÿ“Website Clicks ๐Ÿ–ฑ๏ธ

๐Ÿท๏ธ Qualitative (Categorical)

Can't be measured numerically โ€” sorted by category, not by number.

Nominal: Categories with no specific order.

Blood Type ๐ŸฉธColor ๐ŸŽจCar Brand ๐Ÿš˜Job Title ๐Ÿ‘”Marital Status ๐Ÿ’Eye Color ๐Ÿ‘๏ธ

Ordinal: Has a natural order or ranking.

Education ๐ŸŽ“Rating โญSatisfaction ๐Ÿ“ŠT-Shirt Size ๐Ÿ‘•Pain Level ๐ŸฅMilitary Rank ๐ŸŽ–๏ธ
๐Ÿ’ก Why does this matter? Data type determines which visualizations, statistics, and encoding methods you can use. Treating ordinal data as nominal (or vice versa) leads to incorrect models.

๐ŸŽฏ Classify these features

Exploratory Data Analysis (EDA)

EDA means getting to know your data before modelling. It guides cleaning, feature engineering, and gives you intuition about patterns and problems in the dataset. We use the Heart Failure Prediction Dataset (50 patients, 11 features) as our running example.

๐Ÿ“Š Descriptive Statistics โ€” df.describe()

This is always your first look at numerical features. Select a feature to see its full summary statistics โ€” count, mean, standard deviation, min, quartiles, and max. The histogram below shows the shape of the distribution, with a red line marking the mean:

Distribution Histogram
๐Ÿ“ฆ Interactive Box Plot with Outlier Detection

A box plot visualizes the five-number summary (min, Q1, median, Q3, max) and identifies outliers using the IQR rule. Any point below Q1 โˆ’ 1.5ร—IQR or above Q3 + 1.5ร—IQR is flagged as an outlier.

IQR = Q3 โˆ’ Q1
Lower fence = Q1 โˆ’ 1.5 ร— IQR
Upper fence = Q3 + 1.5 ร— IQR // points outside โ†’ outliers
๐Ÿ’ก
Why it matters: Outliers can heavily skew mean-based imputation and distance-based models (KNN, SVM). Box plots help you decide: is this a data error (remove), a rare but valid case (keep), or an extreme value to cap (winsorize)?
๐Ÿ”ต Scatter Plot โ€” Correlation Explorer

Scatter plots show the relationship between two numerical features. Color points by the target variable to reveal patterns. A Pearson correlation close to ยฑ1 means a strong linear relationship; near 0 means no linear relationship.

๐Ÿ“
Pearson correlation formula: r = ฮฃ(xแตข โˆ’ ฮผโ‚“)(yแตข โˆ’ ฮผแตง) / (n ร— ฯƒโ‚“ ร— ฯƒแตง). Values: +1 = perfect positive, 0 = none, โˆ’1 = perfect negative. Try MaxHR vs Age โ€” you should see a moderate negative correlation.
๐Ÿ“Š Segmentation โ€” Categorical ร— Numerical

How does a numerical feature differ across categories? Grouped box plots are ideal for this โ€” they show the median, quartiles, and outliers for each group side by side. If the boxes barely overlap, the categorical feature is likely predictive.

Grouped Box Plots โ€” compare distributions across categories
๐Ÿ’ก
Key insight: Try "MaxHR grouped by HeartDisease" โ€” the boxes should have minimal overlap, showing patients with heart disease tend to have lower maximum heart rates. Compare with "Cholesterol grouped by HeartDisease" โ€” heavily overlapping boxes mean cholesterol alone is a weaker predictor.
๐Ÿ“‹ Categorical Feature Summary

For categorical features, we look at value counts and frequencies instead of mean/std. An imbalanced category distribution (e.g., 90% one class) can cause models to be biased.

Value Counts
๐Ÿ’พ The Heart Failure Dataset at a Glance

50 patients (sample), 11 features, binary target. Here are the first rows:

Features
Numerical: Age, RestingBP, Cholesterol, MaxHR, Oldpeak  |  Categorical: Gender, ChestPainType, FastingBS, RestingECG, ExerciseAngina  |  Target: HeartDisease (0/1)

Data Quality Assessment

Five dimensions of data quality.

๐Ÿ” Spot the issues

Data Cleaning

Fix quality problems step by step.

๐Ÿงน Interactive Cleaning Simulator

Missing Value Imputation

Replace missing values. Method affects distributions.

๐Ÿงช Try strategies

๐Ÿ“ˆ Distribution: Before vs After

โš ๏ธ Mean reduces variance. Zero introduces bias. Remove loses data.

Feature Engineering

The art of transforming raw data into features that help models learn better.

๐Ÿ’ก What is Feature Engineering?

Feature engineering transforms raw data into better inputs for ML models. Think of it as translating domain knowledge into numbers a model can understand. A doctor doesn't just look at raw blood pressure โ€” they compute pulse pressure, MAP, and risk categories. Feature engineering teaches models to do the same.

โ“ Why It Matters
๐ŸŽฏ Selection
๐Ÿงฎ Create Features
๐Ÿ”„ Transforms
๐Ÿ“Š Correlation
๐Ÿ”— Logic Combos

๐ŸŽฏ The Goal

Better features = better models. Even the best algorithm can't learn from bad features.

Consider predicting heart disease with raw data: age, blood pressure, cholesterol. A doctor would also compute:

  • Pulse Pressure = Systolic โˆ’ Diastolic โ†’ arterial stiffness
  • HR % of Max = MaxHR / (220 โˆ’ Age) โ†’ cardiac fitness
  • BMI = weight / heightยฒ โ†’ body composition

These derived features encode expert knowledge that raw numbers alone can't express.

๐Ÿ“Š Impact on Accuracy

Adding well-chosen features improves accuracy. But adding too many hurts โ€” that's overfitting.
๐Ÿ”‘ Four Pillars of Feature Engineering
1. Selection
Remove irrelevant/redundant features โ†’ less noise, less overfitting
2. Creation
Combine features (A+B, Aร—B, A/B) โ†’ capture hidden patterns
3. Transformation
Apply math (log, โˆšx) โ†’ fix skewed distributions
4. Domain Knowledge
Use expert insight (BMI, MAP) โ†’ meaningful derived features
๐Ÿง  Quiz: Which is a created feature (not raw data)?
๐ŸŽฏ Feature Selection โ€” Choosing the Right Inputs

Not all features help. Some are irrelevant (patient ID), some redundant (height in cm AND inches), some add noise.

ApproachHowExample
FilterRank by statisticsCorrelation, chi-squared
WrapperTry subsets, evaluateForward/backward selection
EmbeddedModel learns importanceLasso, tree feature importance

๐Ÿงช Interactive: Build your feature set for Heart Disease prediction

Click features to toggle. Watch how accuracy and complexity change. Try to find the optimal subset!

๐Ÿง  Quiz: What happens if you use ALL features including irrelevant ones?
๐Ÿงฎ Creating New Features โ€” Combining Existing Data

You can build new features using arithmetic operations on existing ones. This helps models see relationships hidden in individual features.

OperationFormulaUse Case
SumA + BTotal score, combined effect
DifferenceA โˆ’ BPulse pressure, profit margin
RatioA / BBMI, price-per-unit
ProductA ร— BInteraction effect
PolynomialAยฒNon-linear relationships

๐Ÿฉบ Live Example: Blood Pressure

Move the sliders and watch derived medical features update in real time.

A + B (sum)
200
A โˆ’ B (pulse pr.)
40
A / B (ratio)
1.50
MAP
93

๐Ÿ“ˆ Feature Values Visualized

๐Ÿฅ What Do These Mean?

Pulse Pressure (Aโˆ’B): Normal 30โ€“50 mmHg. High โ†’ stiff arteries.
MAP (Mean Arterial Pressure = B + โ…“(Aโˆ’B)): Normal 70โ€“100. Organs need MAP > 60.
Ratio (A/B): ~1.5 is normal. Higher โ†’ isolated systolic hypertension.

๐Ÿ”„ Feature Transformations โ€” Changing the Shape of Data

Transformations change a feature's distribution shape. This matters because many algorithms assume roughly symmetric distributions.

TransformEffectWhen to Use
log(x)Compresses large valuesRight-skewed data (income, population)
โˆšxMild compressionCount data, moderate skew
xยฒAmplifies differencesNeed to emphasize large values
1/xInverts scaleRate/frequency โ†’ period

๐Ÿงช Transform Explorer

๐Ÿ“Š Before vs After

๐Ÿ“Š Feature Correlation โ€” Finding Redundancy

Correlation (r) measures how two features move together: r = +1 (same direction), r = โˆ’1 (opposite), r = 0 (no relationship). Highly correlated features are redundant โ€” keeping both wastes model capacity.

๐Ÿ”ฅ Correlation Heatmap

Click any cell to see the scatter plot for that pair.

๐Ÿ“ˆ Click a cell to explore

Click any cell in the heatmap.
๐Ÿ”— Logical Feature Combinations โ€” Interaction Effects

For binary features (0/1), you can create new features using logical operations. This captures interaction effects โ€” e.g., "High BP AND Smoker" is a much stronger heart disease predictor than either alone.

Why? A linear model computes wโ‚xโ‚ + wโ‚‚xโ‚‚. It cannot learn that xโ‚=1 AND xโ‚‚=1 has a special combined effect. Adding xโ‚ยทxโ‚‚ as a new feature solves this.

๐ŸŽฎ Toggle Risk Factors



Full Truth Table

๐Ÿง  Quiz: Can a linear model (wโ‚xโ‚ + wโ‚‚xโ‚‚) represent AND logic?

Feature Scaling

Different scales distort distance-based algorithms.

๐Ÿ“ Why Scale?
๐Ÿ”„ Calculator
๐Ÿ“Š Compare
๐ŸŽฏ KNN Demo

Distance Without Scaling

Pressure (~1013) dominates temperature (~20).

Which algorithms need scaling?

AlgorithmScaling?Why
KNNโœ… YesDistance-based
SVMโœ… YesMargin depends on scale
Linear Reg.โœ… YesGradient descent
Neural Netsโœ… YesActivations
Decision TreeโŒ NoThreshold splits
Random ForestโŒ NoTree-based

๐Ÿ”„ Interactive Scaler Calculator

Min-Max [0,1]
0.500
Z-Score
0.500
Min-Max: (xโˆ’min)/(maxโˆ’min)
Z-Score: (xโˆ’ฮผ)/ฯƒ

Visual Position

๐Ÿ“Š Before vs After โ€” All Features on Same Scale

Key observation: Before scaling, features have vastly different ranges (Age: ~30โ€“77, RestBP: ~90โ€“180, Chol: ~100โ€“400). After scaling, all features share the same range, so no single feature dominates distance calculations. Compare how the dots spread differently on the left vs. right!

๐ŸŽฏ KNN: Scaling Impact

See how scaling changes which neighbors are closest.

Without Scaling

With Min-Max Scaling

Categorical Feature Encoding

ML models need numbers. How you encode matters.

๐Ÿ”ข One-Hot
๐Ÿ“ Ordinal
๐Ÿท๏ธ Label
โš–๏ธ Compare
๐Ÿง  Quiz

๐Ÿ”ข One-Hot Encoding

Each category โ†’ binary column. High cardinality โ†’ many sparse columns.

Visual: Equidistant

All categories equidistant (โˆš2). No ordering.

๐Ÿ“ Ordinal Encoding

Number Line

โš ๏ธ Don't use for nominal data!

๐Ÿท๏ธ Label Encoding

โš ๏ธ Arbitrary numbers โ†’ false ordering!

When to use what?

MethodBest ForRisk
One-HotNominalHigh dim
OrdinalOrdinalEqual spacing
LabelTrees / binaryFalse order

โš–๏ธ Distance Comparison

๐Ÿง  Encoding Quiz