What is a Sklearn Pipeline?
Chaining preprocessing steps and a model into a single, reproducible workflow
The Problem
In a typical ML workflow, you need to apply multiple transformations before training a model:
Without a pipeline, you must manually call fit_transform() on each step, track fitted transformers, and reapply them identically at prediction time. This is error-prone and verbose.
The Solution: sklearn.pipeline.Pipeline
A Pipeline bundles a sequence of transforms and a final estimator into a single object that behaves like an estimator itself:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Lasso pipe = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso()) ]) pipe.fit(X_train, y_train) # fits scaler, then model pipe.predict(X_test) # transforms, then predicts
Why Use Pipelines?
Five compelling reasons to adopt pipelines in every ML project
1. ๐ก๏ธ Prevent Data Leakage
Without a pipeline, it's easy to accidentally fit a scaler on the full dataset (including test data) before splitting. A pipeline guarantees transformers are fit only on training data during cross-validation.
2. ๐งผ Cleaner Code
Replace dozens of manual fit_transform() / transform() calls with a single pipe.fit() and pipe.predict().
# Without pipeline (messy) X_train_imp = imputer.fit_transform(X_train) X_train_sc = scaler.fit_transform(X_train_imp) model.fit(X_train_sc, y_train) X_test_imp = imputer.transform(X_test) X_test_sc = scaler.transform(X_test_imp) preds = model.predict(X_test_sc) # With pipeline (clean) pipe.fit(X_train, y_train) preds = pipe.predict(X_test)
3. ๐ Easy Cross-Validation
Pass the entire pipeline to cross_val_score() โ each fold correctly fits transformers only on its training portion.
4. ๐ฏ Grid Search over Everything
Tune hyperparameters of any step using GridSearchCV with the stepname__param syntax:
params = {'scaler__with_mean': [True, False],
'lasso__alpha': [0.01, 0.1, 1.0]}
5. ๐ฆ Deployment-Ready
Serialize the entire pipeline with joblib.dump(pipe, 'model.pkl') โ one file contains all preprocessing + model logic.
Anatomy of a Pipeline
Understanding the building blocks
Two Types of Components
๐ Transformer
Has fit() + transform() methods.
All intermediate steps must be transformers.
Examples: StandardScaler, OneHotEncoder, PCA, SimpleImputer
๐ฏ Estimator
Has fit() + predict() methods.
Only the last step can be an estimator (or another transformer).
Examples: Lasso, RandomForestRegressor, SVC
The steps Parameter
A pipeline is defined by a list of (name, estimator) tuples:
pipe = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), # step 0: transformer ('scaler', StandardScaler()), # step 1: transformer ('pca', PCA(n_components=5)), # step 2: transformer ('lasso', Lasso()) # step 3: final estimator ])
Or use the shorthand make_pipeline() which auto-generates names:
from sklearn.pipeline import make_pipeline pipe = make_pipeline( SimpleImputer(strategy='median'), StandardScaler(), PCA(n_components=5), Lasso() )
Accessing Steps
pipe.steps # list of (name, estimator) tuples pipe.named_steps # dict-like access pipe[0] # first step by index pipe['scaler'] # step by name pipe[:2] # slicing โ returns a sub-Pipeline
transform(). The last step can be any estimator (classifier, regressor) or a transformer.
How fit / transform / predict Work
Visualizing the data flow through the pipeline
Pipeline Flow โ Reference Diagram
The following diagram shows the complete pipeline flow for both Step 1 (Training) and Step 2 (Prediction):
fit_transform(X, y) and pass the result to the next step. On the final step, call only fit(X, y).
transform(X) (using parameters learned during fit). On the final step, call predict(X).
pipe.fit_predict(X, y) is equivalent to pipe.fit(X, y).predict(X) but may be more efficient if the final estimator implements fit_predict.
Similarly, pipe.fit_transform(X, y) calls fit_transform on all steps (requires last step to be a transformer).
ColumnTransformer Example
Handling heterogeneous data (numerical and categorical features) in a single pipeline
Why do we need ColumnTransformer?
Most real-world datasets contain a mix of feature types. For example, a dataset might have numerical columns (like Age and Fare) and categorical columns (like Sex and Embarked). You cannot apply a StandardScaler to categorical text, and you shouldn't apply OneHotEncoder to continuous numbers.
ColumnTransformer allows you to apply different preprocessing pipelines to different subsets of features, and then it automatically concatenates the results back into a single feature matrix.
['Age', 'Fare', 'Sex', 'Embarked']
columns: ['Age', 'Fare']
columns: ['Sex', 'Embarked']
Concatenates transformed columns back together
The Code
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import Lasso # 1. Define the sub-pipelines num_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) cat_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]) # 2. Combine them using ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', num_pipe, ['Age', 'Fare']), ('cat', cat_pipe, ['Sex', 'Embarked']) ]) # 3. Create the final pipeline full_pipe = Pipeline([ ('preprocessor', preprocessor), ('lasso', Lasso(alpha=0.1)) ]) # 4. Fit and predict just like a normal pipeline! full_pipe.fit(X_train, y_train) preds = full_pipe.predict(X_test)
Code Examples
Common pipeline patterns you'll use in practice
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) pipe = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso(alpha=0.1)) ]) pipe.fit(X_train, y_train) score = pipe.score(X_test, y_test) print(f"R^2 Score: {score:.3f}")
from sklearn.model_selection import cross_val_score pipe = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso()) ]) # Each fold: fit scaler on train fold, transform train+val, fit model scores = cross_val_score(pipe, X, y, cv=5, scoring='r2') print(f"Mean R^2: {scores.mean():.3f} ยฑ {scores.std():.3f}")
from sklearn.model_selection import GridSearchCV pipe = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso()) ]) # Use stepname__param syntax param_grid = { 'scaler__with_mean': [True, False], 'lasso__alpha': [0.001, 0.01, 0.1, 1.0, 10.0] } grid = GridSearchCV(pipe, param_grid, cv=5, scoring='r2') grid.fit(X_train, y_train) print(grid.best_params_) print(f"Best CV R^2: {grid.best_score_:.3f}")
Advanced Topics
Custom transformers, visualization, and caching
Custom Transformers with FunctionTransformer
from sklearn.preprocessing import FunctionTransformer import numpy as np log_transform = FunctionTransformer(np.log1p, validate=True) pipe = Pipeline([ ('log', log_transform), ('scaler', StandardScaler()), ('lasso', Lasso()) ])
Visualizing a Pipeline
In modern versions of scikit-learn (1.4+), simply evaluating the pipeline object in a Jupyter Notebook cell automatically renders a beautiful, interactive HTML diagram:
pipe # Type the variable name as the last line in a cell to display it
Note: If you are using an older version and it prints text instead, you can manually enable it using sklearn.set_config(display='diagram').
Pipeline Memory (Caching)
Use the memory parameter to cache fitted transformers โ useful when transformations are expensive:
pipe = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso()) ], memory='cache_dir')
Test Your Understanding
Check what you've learned about sklearn pipelines
1. What methods must all intermediate steps in a Pipeline implement?
2. When pipe.fit(X, y) is called, what happens at intermediate steps?
3. How do you reference the hyperparameter alpha of a step named 'lasso' in GridSearchCV?
4. What is the main advantage of using a Pipeline with cross-validation?
5. Which component allows applying different transformations to different column groups?
6. When pipe.predict(X_new) is called on new data, what happens at intermediate steps?