Lasso Model Selection & K-Fold CV

Discover how GridSearchCV calculates the mean and standard deviation of validation folds to find the best α, and how refit=True fully leverages your training data.

📋 The 5-Step Workflow

Hold Out Test Data

Reserve 20% of the dataset as a sealed test set to prevent target leakage.

Define the Alpha Grid

List candidate values of α to evaluate: [0.01, 0.1, 1.0, 10.0].

Run K-Fold Validation

For each α, perform 5 CV splits. Track the R-squared score for each fold to compute a mean and std dev.

Select Best α

The parameter mapping yielding the highest mean_test_score is selected as the winner.

refit=True & Final Test

Scikit-learn discards the fractional CV models and retrains a brand new model on 100% of the Training Data using the winning α, then scores it against the Test Data.

Interactive Grid Search Simulator

IDLE

Step 2 — Alpha Candidates

Step 4 — Output: cv_results_ DataFrame

	param_alpha	mean_test_score	std_test_score

Step 3 — Inner Cross-Validation Loop

Entire Dataset

Training Data (80%) — Used for all CV steps and the Final Refit

Test Data (20%)

Finding Best α

Repeating for each candidate

Step 5 — Final evaluation

Test Data

💻 Scikit-Learn Implementation

lasso_grid_search.py

STEP 1 — Hold out test data
 1import pandas as pd
 2from sklearn.model_selection import train_test_split, GridSearchCV
 3from sklearn.linear_model import Lasso
 4
 5X_train, X_test, y_train, y_test = train_test_split(
 6    X, y, test_size=0.2, random_state=42
 7)
 8
STEP 2 — Define the alpha grid
 9param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0]}
10
STEP 3 — Set up GridSearchCV
11grid_search = GridSearchCV(
12    estimator  = Lasso(max_iter=10000),
13    param_grid = param_grid,
14    cv         = 5,
15    scoring    = 'r2',
16    refit      = True    # CRITICAL: Automatically retrain on 100% of X_train once best alpha is found!
17)
18
STEP 4 — Execute search and view cv_results_
19grid_search.fit(X_train, y_train)
20
21# Extract the mean and std deviation computed for each alpha
22cv_df = pd.DataFrame(grid_search.cv_results_)
23print(cv_df[['param_alpha', 'mean_test_score', 'std_test_score']])
24
25print("\nSelected Best Alpha:", grid_search.best_params_['alpha'])
26
STEP 5 — Final evaluation on the sealed test set
27# Because refit=True, grid_search has ALREADY built a new model using 
28# all 80% of our original data and the best alpha. We just call score().
29
30final_r2 = grid_search.score(X_test, y_test)
31print("Final Test R2:", round(final_r2, 3))