Lasso Model Selection & K-Fold CV

Discover how GridSearchCV calculates the mean and standard deviation of validation folds to find the best α, and how refit=True fully leverages your training data.

📋 The 5-Step Workflow

1

Hold Out Test Data

Reserve 20% of the dataset as a sealed test set to prevent target leakage.

2

Define the Alpha Grid

List candidate values of α to evaluate: [0.01, 0.1, 1.0, 10.0].

3

Run K-Fold Validation

For each α, perform 5 CV splits. Track the R-squared score for each fold to compute a mean and std dev.

4

Select Best α

The parameter mapping yielding the highest mean_test_score is selected as the winner.

5

refit=True & Final Test

Scikit-learn discards the fractional CV models and retrains a brand new model on 100% of the Training Data using the winning α, then scores it against the Test Data.

Interactive Grid Search Simulator

IDLE
Step 2 — Alpha Candidates
Step 4 — Output: cv_results_ DataFrame
param_alphamean_test_scorestd_test_score
Step 3 — Inner Cross-Validation Loop
Entire Dataset
Training Data (80%) — Used for all CV steps and the Final Refit
Test Data (20%)
Finding Best α
Repeating for each candidate
Step 5 — Final evaluation
Test Data

💻 Scikit-Learn Implementation

lasso_grid_search.py
STEP 1 — Hold out test data 1import pandas as pd 2from sklearn.model_selection import train_test_split, GridSearchCV 3from sklearn.linear_model import Lasso 4 5X_train, X_test, y_train, y_test = train_test_split( 6 X, y, test_size=0.2, random_state=42 7) 8 STEP 2 — Define the alpha grid 9param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0]} 10 STEP 3 — Set up GridSearchCV 11grid_search = GridSearchCV( 12 estimator = Lasso(max_iter=10000), 13 param_grid = param_grid, 14 cv = 5, 15 scoring = 'r2', 16 refit = True # CRITICAL: Automatically retrain on 100% of X_train once best alpha is found! 17) 18 STEP 4 — Execute search and view cv_results_ 19grid_search.fit(X_train, y_train) 20 21# Extract the mean and std deviation computed for each alpha 22cv_df = pd.DataFrame(grid_search.cv_results_) 23print(cv_df[['param_alpha', 'mean_test_score', 'std_test_score']]) 24 25print("\nSelected Best Alpha:", grid_search.best_params_['alpha']) 26 STEP 5 — Final evaluation on the sealed test set 27# Because refit=True, grid_search has ALREADY built a new model using 28# all 80% of our original data and the best alpha. We just call score(). 29 30final_r2 = grid_search.score(X_test, y_test) 31print("Final Test R2:", round(final_r2, 3))