ML Project Workflow
A typical ML project follows interconnected steps. Click each phase to learn more.
Modified by me: Source: successfactory management coaching gmbh
Data Types
Understanding data types is the first step in any data analysis.
๐ณ Data Types Hierarchy
๐ Quantitative (Numerical)
Can be expressed in numerical values, suitable for statistical analysis.
Continuous: Can take any value within a range.
Discrete: Can take only finite, countable values.
๐ท๏ธ Qualitative (Categorical)
Can't be measured numerically โ sorted by category, not by number.
Nominal: Categories with no specific order.
Ordinal: Has a natural order or ranking.
๐ฏ Classify these features
Exploratory Data Analysis (EDA)
EDA means getting to know your data before modelling. It guides cleaning, feature engineering, and gives you intuition about patterns and problems in the dataset. We use the Heart Failure Prediction Dataset (50 patients, 11 features) as our running example.
This is always your first look at numerical features. Select a feature to see its full summary statistics โ count, mean, standard deviation, min, quartiles, and max. The histogram below shows the shape of the distribution, with a red line marking the mean:
A box plot visualizes the five-number summary (min, Q1, median, Q3, max) and identifies outliers using the IQR rule. Any point below Q1 โ 1.5รIQR or above Q3 + 1.5รIQR is flagged as an outlier.
Lower fence = Q1 โ 1.5 ร IQR
Upper fence = Q3 + 1.5 ร IQR // points outside โ outliers
Scatter plots show the relationship between two numerical features. Color points by the target variable to reveal patterns. A Pearson correlation close to ยฑ1 means a strong linear relationship; near 0 means no linear relationship.
How does a numerical feature differ across categories? Grouped box plots are ideal for this โ they show the median, quartiles, and outliers for each group side by side. If the boxes barely overlap, the categorical feature is likely predictive.
For categorical features, we look at value counts and frequencies instead of mean/std. An imbalanced category distribution (e.g., 90% one class) can cause models to be biased.
50 patients (sample), 11 features, binary target. Here are the first rows:
Data Quality Assessment
Five dimensions of data quality.
๐ Spot the issues
Data Cleaning
Fix quality problems step by step.
๐งน Interactive Cleaning Simulator
Missing Value Imputation
Replace missing values. Method affects distributions.
๐งช Try strategies
๐ Distribution: Before vs After
Feature Engineering
The art of transforming raw data into features that help models learn better.
๐ก What is Feature Engineering?
Feature engineering transforms raw data into better inputs for ML models. Think of it as translating domain knowledge into numbers a model can understand. A doctor doesn't just look at raw blood pressure โ they compute pulse pressure, MAP, and risk categories. Feature engineering teaches models to do the same.
๐ฏ The Goal
Better features = better models. Even the best algorithm can't learn from bad features.
Consider predicting heart disease with raw data: age, blood pressure, cholesterol. A doctor would also compute:
- Pulse Pressure = Systolic โ Diastolic โ arterial stiffness
- HR % of Max = MaxHR / (220 โ Age) โ cardiac fitness
- BMI = weight / heightยฒ โ body composition
These derived features encode expert knowledge that raw numbers alone can't express.
๐ Impact on Accuracy
๐ Four Pillars of Feature Engineering
Remove irrelevant/redundant features โ less noise, less overfitting
Combine features (A+B, AรB, A/B) โ capture hidden patterns
Apply math (log, โx) โ fix skewed distributions
Use expert insight (BMI, MAP) โ meaningful derived features
๐ฏ Feature Selection โ Choosing the Right Inputs
Not all features help. Some are irrelevant (patient ID), some redundant (height in cm AND inches), some add noise.
| Approach | How | Example |
|---|---|---|
| Filter | Rank by statistics | Correlation, chi-squared |
| Wrapper | Try subsets, evaluate | Forward/backward selection |
| Embedded | Model learns importance | Lasso, tree feature importance |
๐งช Interactive: Build your feature set for Heart Disease prediction
Click features to toggle. Watch how accuracy and complexity change. Try to find the optimal subset!
๐งฎ Creating New Features โ Combining Existing Data
You can build new features using arithmetic operations on existing ones. This helps models see relationships hidden in individual features.
| Operation | Formula | Use Case |
|---|---|---|
| Sum | A + B | Total score, combined effect |
| Difference | A โ B | Pulse pressure, profit margin |
| Ratio | A / B | BMI, price-per-unit |
| Product | A ร B | Interaction effect |
| Polynomial | Aยฒ | Non-linear relationships |
๐ฉบ Live Example: Blood Pressure
Move the sliders and watch derived medical features update in real time.
๐ Feature Values Visualized
๐ฅ What Do These Mean?
Pulse Pressure (AโB): Normal 30โ50 mmHg. High โ stiff arteries.
MAP (Mean Arterial Pressure = B + โ
(AโB)): Normal 70โ100. Organs need MAP > 60.
Ratio (A/B): ~1.5 is normal. Higher โ isolated systolic hypertension.
๐ Feature Transformations โ Changing the Shape of Data
Transformations change a feature's distribution shape. This matters because many algorithms assume roughly symmetric distributions.
| Transform | Effect | When to Use |
|---|---|---|
| log(x) | Compresses large values | Right-skewed data (income, population) |
| โx | Mild compression | Count data, moderate skew |
| xยฒ | Amplifies differences | Need to emphasize large values |
| 1/x | Inverts scale | Rate/frequency โ period |
๐งช Transform Explorer
๐ Before vs After
๐ Feature Correlation โ Finding Redundancy
Correlation (r) measures how two features move together: r = +1 (same direction), r = โ1 (opposite), r = 0 (no relationship). Highly correlated features are redundant โ keeping both wastes model capacity.
๐ฅ Correlation Heatmap
Click any cell to see the scatter plot for that pair.
๐ Click a cell to explore
๐ Logical Feature Combinations โ Interaction Effects
For binary features (0/1), you can create new features using logical operations. This captures interaction effects โ e.g., "High BP AND Smoker" is a much stronger heart disease predictor than either alone.
Why? A linear model computes wโxโ + wโxโ. It cannot learn that xโ=1 AND xโ=1 has a special combined effect. Adding xโยทxโ as a new feature solves this.
๐ฎ Toggle Risk Factors
Full Truth Table
Feature Scaling
Different scales distort distance-based algorithms.
Distance Without Scaling
Which algorithms need scaling?
| Algorithm | Scaling? | Why |
|---|---|---|
| KNN | โ Yes | Distance-based |
| SVM | โ Yes | Margin depends on scale |
| Linear Reg. | โ Yes | Gradient descent |
| Neural Nets | โ Yes | Activations |
| Decision Tree | โ No | Threshold splits |
| Random Forest | โ No | Tree-based |
๐ Interactive Scaler Calculator
Z-Score: (xโฮผ)/ฯ
Visual Position
๐ Before vs After โ All Features on Same Scale
๐ฏ KNN: Scaling Impact
See how scaling changes which neighbors are closest.
Without Scaling
With Min-Max Scaling
Categorical Feature Encoding
ML models need numbers. How you encode matters.
๐ข One-Hot Encoding
Visual: Equidistant
๐ Ordinal Encoding
Number Line
๐ท๏ธ Label Encoding
When to use what?
| Method | Best For | Risk |
|---|---|---|
| One-Hot | Nominal | High dim |
| Ordinal | Ordinal | Equal spacing |
| Label | Trees / binary | False order |