LASSO特征筛选实战：连续目标变量回归模型完整实现（附代码） - 文章 - 开发者社区

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍： LASSO特征筛选实战连续目标变量回归模型完整实现，作者根据个人对机器学习的理解进行代码实现与图表输出，仅供参考。完整数据和代码将在稍后上传至交流群，成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。文末提供高效的学习工具~！

✨ 相关信息 ✨

LASSO特征选择是一种基于L1正则化的线性回归方法，它在标准线性回归的基础上引入了一个惩罚项，即对模型系数的绝对值求和进行缩减（shrinkage），从而实现对不相关或冗余特征的自动筛选。其核心机制是通过调整正则化强度参数α，将无关特征的系数精确压缩为零，而保留真正重要的特征系数。LASSO的作用在于处理高维数据时有效减少过拟合、提升模型的泛化能力和解释性，尤其适合特征数量远多于样本的情况，能帮助从海量变量中挖掘关键驱动因素，例如在影像特征筛选当中是常见的处理方法，当然这种更多是针对二分类模型感兴趣的读者可参考文章——期刊复现：如何正确使用LASSO进行二分类特征选择？避开常见误区，掌握实用技巧，相较于二分类模型，回归模型利用LASSO进行特征选择会更简单

✨ 代码实现 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告  
warnings.filterwarnings("ignore")  
  
path = r"2025-11-18公众号Python机器学习AI.xlsx"  
df = pd.read_excel(path)  
from sklearn.model_selection import train_test_split  
# 划分特征和目标变量  
X = df.drop(['target'], axis=1)    
y = df[['target']]    
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(  
    X,    
    y,   
    test_size=0.3,   
    random_state=42  
)  
from sklearn.preprocessing import StandardScaler  
  
# 创建标准化器  
scaler = StandardScaler()  
# 使用训练集的特征拟合标准化器，并对训练集进行转换  
X_train_scaled = scaler.fit_transform(X_train)  
# 将标准化后的 X_train 转换为 DataFrame，并保持列名  
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)  
# 使用相同的标准化器对测试集进行转换  
X_test_scaled = scaler.transform(X_test)  
# 将标准化后的 X_test 转换为 DataFrame，并保持列名  
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

读取数据集，划分特征X和目标y，拆分训练/测试集（测试集占比30%），使用StandardScaler对训练集拟合并转换特征（同时保持DataFrame列名），对测试集进行相应转换，以准备LASSO模型的输入数据；这里进行标准化是因为LASSO作为线性回归模型，其系数估计会受不同特征量纲（如单位差异）的影响，导致训练偏差，因此标准化能将所有特征缩放到相同尺度，避免量纲干扰模型收敛和特征选择效果

  
import warnings  
warnings.filterwarnings("ignore", category=UserWarning)  
  
from sklearn.linear_model import LassoCV  
from sklearn.model_selection import RepeatedKFold  
feature_names = X.columns  
  
# 定义一组 alpha 值的范围  
alphas = np.logspace(-6, 1, 100)    
  
# 使用交叉验证的 LassoCV  
lasso_cv = LassoCV(alphas=alphas, cv=RepeatedKFold(n_splits=10, n_repeats=3, random_state=42), random_state=42)  
lasso_cv.fit(X_train_scaled, y_train)  
  
# 计算均方误差路径和标准差  
mse_path = lasso_cv.mse_path_.mean(axis=1)  # 每个 alpha 的均方误差  
mse_std = lasso_cv.mse_path_.std(axis=1)    # 每个 alpha 的均方误差的标准差  
  
# 找到最佳 alpha 和 1-SE 规则的 alpha  
best_alpha_index = np.argmin(mse_path)  # 最小均方误差的索引  
best_alpha = lasso_cv.alphas_[best_alpha_index]  # 最佳 alpha 值  
one_se_index = np.where(mse_path <= mse_path[best_alpha_index] + mse_std[best_alpha_index])[0][0]  # 1-SE 规则的 alpha 索引  
one_se_alpha = lasso_cv.alphas_[one_se_index]  # 1-SE 规则的 alpha 值  
  
# 打印最佳 alpha 值  
print(f"Best alpha (λ_min): {best_alpha}")  
print(f"1-SE rule alpha (λ_1se): {one_se_alpha}")  
  
# 为两个 alpha 值进行特征选择  
lasso_best_alpha = LassoCV(alphas=[best_alpha], cv=RepeatedKFold(n_splits=10, n_repeats=3, random_state=42), random_state=42)  
lasso_best_alpha.fit(X_train_scaled, y_train)  
selected_features_best = [feature_names[i] for i in np.where(lasso_best_alpha.coef_ != 0)[0]]  # 获取最佳 alpha 下的特征名  
print(f"Selected features with λ_min: {selected_features_best}")  # 打印 λ_min 下选择的特征名  
  
lasso_one_se_alpha = LassoCV(alphas=[one_se_alpha], cv=RepeatedKFold(n_splits=10, n_repeats=3, random_state=42), random_state=42)  
lasso_one_se_alpha.fit(X_train_scaled, y_train)  
selected_features_one_se = [feature_names[i] for i in np.where(lasso_one_se_alpha.coef_ != 0)[0]]  # 获取 1-SE 规则下的特征名  
print(f"Selected features with λ_1se: {selected_features_one_se}")  # 打印 λ_1se 下选择的特征名

利用Scikit-learn的LassoCV实现LASSO回归模型的交叉验证和特征选择，针对连续目标变量进行优化。导入LassoCV和RepeatedKFold（10折交叉验证重复3次以提升稳定性），并假设特征名为X.columns；然后定义alpha（正则化强度）范围为对数尺度[-6,1]的100个值，进行LassoCV拟合训练数据，计算每个alpha下的均方误差路径（mse_path的均值）和标准差（mse_std）；基于此，识别最佳alpha（λ_min，对应最小MSE）和1-SE规则alpha（λ_1se，即MSE不超过最小MSE加1个标准差的最大alpha，以实现更保守的稀疏性）；最后，为这两个alpha分别重新拟合模型，提取非零系数的特征名，并打印结果，从而自动筛选出对目标变量真正重要的特征，帮助模型从高维数据中去除噪音，提升解释性和泛化能力

  
Best alpha (λ_min): 0.7390722033525775  
1-SE rule alpha (λ_1se): 1.9630406500402684  
Selected features with λ_min: ['feature_5', 'feature_6', 'feature_15', 'feature_16', 'feature_25', 'feature_35', 'feature_38', 'feature_39', 'feature_45', 'feature_47', 'feature_51', 'feature_52', 'feature_55', 'feature_66', 'feature_75', 'feature_83', 'feature_99']  
Selected features with λ_1se: ['feature_5', 'feature_15', 'feature_16', 'feature_25', 'feature_35', 'feature_45']

在这个LASSO特征选择结果中，Best alpha (λ_min=0.7391) 通过最小化交叉验证均方误差（MSE）选择了一个较小的正则化强度，从而保留了17个特征，实现高拟合度但泛化风险较高；相比之下，1-SE rule alpha (λ_1se=1.9630) 采用更保守的阈值（MSE不超过λ_min MSE + 1个标准差的最大α），仅保留6个特征，显著提升模型稀疏性和解释性，同时牺牲少量拟合精度以避免过拟合

  
from matplotlib import font_manager  
font_properties = font_manager.FontProperties(weight='bold', size=18)  
plt.figure(figsize=(10, 6))  
plt.errorbar(lasso_cv.alphas_, mse_path, yerr=mse_std, fmt='o', color='red', ecolor='gray', capsize=3)  
plt.axvline(lasso_cv.alphas_[best_alpha_index], linestyle='--', color='black', label=r'$\lambda_{min}$')  
plt.axvline(lasso_cv.alphas_[one_se_index], linestyle='--', color='blue', label=r'$\lambda_{1se}$')  
plt.xscale('log')  # 使用对数刻度显示 alpha 值  
plt.xlabel('Alpha (α) value', fontsize=18, fontweight='bold')  
plt.ylabel('Mean Squared Error (MSE)', fontsize=18, fontweight='bold')  
plt.title('Lasso Regression: MSE vs Alpha (α) value', fontsize=18, fontweight='bold')  
plt.xticks(fontsize=18, fontweight='bold')  
plt.yticks(fontsize=18, fontweight='bold')  
plt.legend(fontsize=18, prop=font_properties)  
plt.tight_layout()    
plt.savefig("lasso-1.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

用横轴为对数尺度的α（正则化强度）绘制Lasso回归的均方误差及误差条，最优α与1-SEα在其中显示，展示不同正则化强度下模型性能的变化

  
from sklearn.linear_model import Lasso  
  
coefs = []  
  
for a in alphas:  
    lasso = Lasso(alpha=a, max_iter=10000)  
    lasso.fit(X_train_scaled, y_train)  
    coefs.append(lasso.coef_)  
  
# 创建字体属性  
font_properties = font_manager.FontProperties(weight='bold', size=18)  
  
# 绘制系数路径  
plt.figure(figsize=(10, 6))  
ax = plt.gca()  
  
# 绘制系数路径图  
ax.plot(np.log10(alphas), coefs)  
  
# 绘制虚线，标记最佳 alpha 和 1-SE alpha  
ax.axvline(np.log10(best_alpha), linestyle='--', color='black', label=r'$\lambda_{min}$')  
ax.axvline(np.log10(one_se_alpha), linestyle='--', color='blue', label=r'$\lambda_{1se}$')  
plt.xlabel('Log Lambda', fontsize=18, fontweight='bold')  
plt.ylabel('Coefficients', fontsize=18, fontweight='bold')  
plt.title('Lasso Paths', fontsize=18, fontweight='bold')  
plt.xticks(fontsize=18, fontweight='bold')  
plt.yticks(fontsize=18, fontweight='bold')  
plt.legend(prop=font_properties)  
plt.axis('tight')  
plt.savefig("lasso-2.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()