期刊复现:基于LightGBM、XGBoost与RF的堆叠模型贝叶斯优化调参与Lasso回归元模型,结合10倍交叉验证

机器学习数据库算法

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍:基于LightGBM、XGBoost与RF的堆叠模型贝叶斯优化调参与Lasso回归元模型,结合10倍交叉验证,数据采用模拟数据无任何现实意义 ,作者根据个人对机器学习的理解进行代码实现与图表输出,仅供参考。 完整 数据和代码将在稍后上传至交流群,成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。

✨ 论文信息 ✨

picture.image

picture.image

这篇文章研究使用了堆叠模型来预测脓毒症相关肝损伤(SALI)的发生,强调该模型在早期干预和个性化治疗策略中的潜在临床应用。文章通过回顾性多中心队列研究,使用来自MIMIC-IV和eICU-CRD数据库的数据。研究中训练了9种单机器学习模型,包括决策树、随机森林(RF)、极限梯度提升(XGBoost)、LightGBM、支持向量机等,并通过堆叠方法(LightGBM、XGBoost和RF作为基学习器,Lasso回归作为元学习器)构建了一个集成模型。该模型通过10倍交叉验证、网格搜索和贝叶斯优化对超参数进行了优化

研究结果表明,堆叠模型在训练集、内部验证集和外部验证集上均表现良好,ROC-AUC分别为0.995、0.838和0.721。模型的主要预测因子包括总胆红素、乳酸、凝血酶原时间和机械通气状态,SHAP分析也证明这些特征在模型预测中的重要性,总的来说,堆叠集成模型能够准确地预测脓毒症相关的肝损伤,具有良好的临床决策支持潜力,有助于实现早期干预和个性化治疗

原文使用R语言实现了堆叠集成模型,其中LightGBM、XGBoost和RF作为基学习器,Lasso回归作为元学习器。为了模拟这一过程,接下来将在一个模拟数据集上使用Python来实现相同的堆叠模型,虽然这与原文中的方法一致,但在实际应用中,作者还加入了自己的思考和调整

✨ 基础代码 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告  
warnings.filterwarnings("ignore")  
  
path = r"2025-8-23公众号Python机器学习AI.xlsx"  
df = pd.read_excel(path)  
from sklearn.model_selection import train_test_split  
  
# 划分特征和目标变量  
X = df.drop(['Electrical_cardioversion'], axis=1)    
y = df['Electrical_cardioversion']    
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(  
    X,    
    y,   
    test_size=0.3,   
    random_state=250807,   
    stratify=df['Electrical_cardioversion']   
)  
from imblearn.over_sampling import SMOTE    
# 导入SMOTE算法用于数据过采样  
smote = SMOTE(sampling_strategy=1, k_neighbors=20, random_state=1314)   
# sampling_strategy=1 表示将少数类样本的数量增加到与多数类相同,即使样本平衡  
# k_neighbors=20 表示用于生成合成样本时使用20个最近邻点,SMOTE算法将基于这些邻居生成新的样本  
  
# 使用SMOTE对训练集数据进行过采样,生成新的平衡数据集  
X_train, y_train = smote.fit_resample(X_train, y_train)

使用SMOTE算法对训练集进行过采样,通过生成新的合成样本来平衡类别不平衡问题,从而使少数类样本的数量与多数类相同

  
from lightgbm import LGBMClassifier  
from skopt import BayesSearchCV  
from skopt.space import Real, Integer  
  
# 设置随机种子  
random_seed = 1314  
  
# 初始化LightGBM模型,固定随机种子  
lgbm_model = LGBMClassifier(random_state=random_seed, verbose=-1)  
  
# 设置贝叶斯优化的参数空间  
param_space = {  
    'n_estimators': Integer(50, 200),  # 树的数量  
    'learning_rate': Real(0.01, 0.2),  # 学习率  
    'max_depth': Integer(3, 10),  # 树的最大深度  
    'num_leaves': Integer(31, 150),  # 树的最大叶子数  
    'min_data_in_leaf': Integer(20, 100),  # 每个叶子节点最小的样本数  
    'feature_fraction': Real(0.5, 1.0),  # 特征的选择比例  
    'bagging_fraction': Real(0.5, 1.0),  # 训练数据的选择比例  
    'bagging_freq': Integer(1, 10),  # bagging 的频率  
    'lambda_l1': Real(0.0, 10.0),  # L1 正则化  
    'lambda_l2': Real(0.0, 10.0),  # L2 正则化  
}  
  
# 创建BayesSearchCV对象,进行10折交叉验证,选择最优参数  
bayes_search = BayesSearchCV(  
    estimator=lgbm_model,   
    search_spaces=param_space,   
    n_iter=50,  # 调整此参数来设置搜索的迭代次数  
    cv=10,  # 10折交叉验证  
    n_jobs=-1,   
    verbose=1,  
    scoring='roc_auc'  # 使用AUC作为评估指标  
)  
  
# 训练贝叶斯优化模型  
bayes_search.fit(X_train, y_train)  
  
# 输出最优参数  
print(f"最优参数:{bayes_search.best_params_}")  
  
# 使用最优参数建立最终LightGBM模型  
best_lgbm_model = bayes_search.best_estimator_

使用贝叶斯优化对LightGBM模型进行超参数调优,并通过10折交叉验证选择最优参数,最终构建出优化后的LightGBM模型。实际上,这种方法就是复现文献中的单一机器学习模型开发过程,虽然在本项目中仅构建LightGBM、XGBoost和RF三个模型,主要目的是为后续的堆叠模型实现提供基础学习器

  
最优参数:OrderedDict([('bagging\_fraction', 0.5), ('bagging\_freq', 1), ('feature\_fraction', 0.5), ('lambda\_l1', 0.0), ('lambda\_l2', 0.0), ('learning\_rate', 0.2), ('max\_depth', 3), ('min\_data\_in\_leaf', 20), ('n\_estimators', 200), ('num\_leaves', 31)])

上面为模型经过贝叶斯优化得到的最优模型参数,下面为这个贝叶斯优化迭代50次,以及10折交叉验证的具体信息

  
{'mean_fit_time': array([0.47141235, 0.56793735, 0.13802657, 0.12051001, 0.15416923,  
        0.1391504 , 0.12191632, 0.19325674, 0.09152734, 0.13570299,  
        0.07423675, 0.05186722, 0.04643362, 0.1920341 , 0.07052174,  
        0.04521391, 0.06697278, 0.13841763, 0.19580948, 0.13909569,  
        0.15705607, 0.03853743, 0.04885747, 0.12798331, 0.02897127,  
        0.11155849, 0.081689  , 0.09965045, 0.22744079, 0.07906342,  
        0.16844356, 0.0300112 , 0.1699877 , 0.21263621, 0.1679965 ,  
        0.11116667, 0.04526923, 0.09896352, 0.07381217, 0.16800377,  
        0.2345834 , 0.18622148, 0.12389345, 0.15850952, 0.13574526,  
        0.17105093, 0.20827687, 0.02118254, 0.11915166, 0.12696743]),  
 'std_fit_time': array([0.08166686, 0.47427748, 0.01508848, 0.01290717, 0.00993853,  
        0.01040365, 0.02048133, 0.0231387 , 0.01366311, 0.00956012,  
        0.00790969, 0.00968859, 0.00590045, 0.02122181, 0.00764867,  
        0.00993905, 0.00647517, 0.01460721, 0.01702726, 0.01622816,  
        0.01348138, 0.00610243, 0.00886168, 0.01023775, 0.00533707,  
        0.01130647, 0.01695288, 0.01190836, 0.01378781, 0.01530867,  
        0.02675768, 0.00369452, 0.02259964, 0.01845608, 0.01096206,  
        0.00961332, 0.00755672, 0.01007901, 0.01132726, 0.0277588 ,  
        0.01668455, 0.01556645, 0.00990387, 0.01308461, 0.01437952,  
        0.01469316, 0.01917468, 0.00533687, 0.01129543, 0.01208819]),  
 'mean_score_time': array([0.01005249, 0.00786049, 0.00716956, 0.00840294, 0.00790768,  
        0.0077678 , 0.01390378, 0.00740159, 0.00740175, 0.00680156,  
        0.00670152, 0.00640109, 0.00590219, 0.00756779, 0.00735221,  
        0.00717421, 0.00769515, 0.00619354, 0.00757706, 0.00807483,  
        0.00680172, 0.00740199, 0.00690165, 0.006603  , 0.00680165,  
        0.00675206, 0.00620141, 0.00706594, 0.00685205, 0.00592213,  
        0.00690167, 0.00644534, 0.00795267, 0.00836887, 0.00625224,  
        0.00740163, 0.00675213, 0.00680141, 0.00620139, 0.00745475,  
        0.00760183, 0.00770311, 0.0071027 , 0.00674474, 0.00673521,  
        0.00715559, 0.00985351, 0.00644705, 0.00660174, 0.00700145]),  
 'std_score_time': array([0.0053423 , 0.00221177, 0.00096081, 0.0014035 , 0.00142936,  
        0.0011526 , 0.00420652, 0.00185531, 0.00102007, 0.00060044,  
        0.00045855, 0.00091684, 0.00030046, 0.00105872, 0.0010499 ,  
        0.00094281, 0.00116701, 0.00092147, 0.0017813 , 0.00114397,  
        0.0006003 , 0.00076681, 0.00094368, 0.00083103, 0.0009799 ,  
        0.00081348, 0.00097995, 0.00084681, 0.00077675, 0.00057694,  
        0.00094366, 0.00068002, 0.00115087, 0.00237004, 0.00116842,  
        0.00185514, 0.00107864, 0.00098014, 0.00074839, 0.00098808,  
        0.00066331, 0.00097926, 0.00097039, 0.00073761, 0.00093041,  
        0.00114255, 0.00422707, 0.00111215, 0.00049021, 0.00109567]),  
 'param_bagging_fraction': masked_array(data=[0.7423028931466307, 0.9380063006967165,  
                    0.7757170419472157, 0.8950417422569352,  
                    0.7580519044138212, 0.5921627242188022,  
                    0.745963690463046, 0.5970396920276715,  
                    0.9578055432028029, 0.7431268552986383, 1.0, 0.5, 0.5,  
                    0.5, 0.5, 0.5, 0.5, 0.7038415235304077, 0.5,  
                    0.5936588795242723, 0.5, 0.5536201045274782, 1.0, 1.0,  
                    1.0, 1.0, 0.7581402481936261, 0.7482148100316208,  
                    0.6320782302442085, 0.597006632663656, 0.5, 0.5,  
                    0.9618966028590572, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0,  
                    0.8673311095211812, 0.8702246127923383,  
                    0.5899074067228626, 0.6497727998755518, 0.5,  
                    0.9131821030375682, 0.9069307255429133,  
                    0.8845382708811773, 0.8349915382372025, 0.5,  
                    0.6462725250962974],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=1e+20),  
 'param_bagging_freq': masked_array(data=[2, 5, 7, 3, 6, 8, 1, 7, 6, 4, 1, 1, 1, 1, 1, 10, 1, 2,  
                    1, 4, 3, 1, 4, 1, 6, 1, 1, 1, 1, 4, 1, 7, 2, 10, 8, 1,  
                    10, 1, 10, 1, 10, 1, 1, 10, 1, 1, 1, 1, 8, 4],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=999999),  
 'param_feature_fraction': masked_array(data=[0.5901778432085792, 0.7512094669726437,  
                    0.9756806366596344, 0.8098233173469511,  
                    0.5410763033636882, 0.9608410261225253,  
                    0.6532723954171916, 0.5526853983604337,  
                    0.7975213577921919, 0.9841978059631173, 0.5, 0.5, 0.5,  
                    0.5, 0.5, 1.0, 0.9250182241662434, 0.5, 0.5, 0.5, 0.5,  
                    0.5739845290872301, 0.6524409739318593, 0.5,  
                    0.9729706224267239, 0.5, 0.5, 0.9183444626142493, 0.5,  
                    0.7550272862288641, 1.0, 1.0, 0.5182632474519837, 0.5,  
                    0.5, 0.5, 0.9654073204246225, 0.6488114897274018, 0.5,  
                    0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 0.6174380363139128, 1.0,  
                    0.9291576465556091, 1.0, 1.0],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=1e+20),  
 'param_lambda_l1': masked_array(data=[4.417711632029077, 9.977689713350774,  
                    0.5279187399888919, 7.599221660899527,  
                    5.5493902753841375, 6.653517120772942,  
                    2.8353285585037997, 6.146770301032932,  
                    0.47985933344575804, 3.298950353117313, 0.0, 0.0, 0.0,  
                    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0,  
                    0.0, 0.0, 10.0, 0.36916491990609, 0.0, 9.7771153245372,  
                    0.0, 0.0, 0.4286692957029426, 0.0, 0.0, 0.0, 10.0, 0.0,  
                    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.04679285984576471,  
                    0.0, 0.8860708729198816, 0.0, 0.0],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=1e+20),  
 'param_lambda_l2': masked_array(data=[2.645483128765203, 2.661013687700577,  
                    4.127507915651929, 2.2243261316556366,  
                    1.8648287462251236, 6.882681138694171,  
                    5.585716309734665, 6.79588608050889, 4.203042813248967,  
                    8.79084749115397, 10.0, 0.0, 0.0, 4.652959526094746,  
                    10.0, 10.0, 4.162360804807008, 10.0, 0.0, 0.0, 0.0,  
                    0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 9.591987420171765, 0.0,  
                    6.260772028089578, 0.0, 3.9001090452718263,  
                    9.510687873625045, 0.0, 6.9091288131198,  
                    5.592492853263656, 5.34261407690923, 0.0, 0.0,  
                    2.622660011934463, 6.6161879188487145,  
                    4.499426526855138, 10.0, 10.0, 10.0,  
                    1.9911710766550492, 10.0, 2.782654677480031, 10.0, 0.0],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=1e+20),  
 'param_learning_rate': masked_array(data=[0.18389356500243567, 0.1837760871343867,  
                    0.17611476688426983, 0.10542056775904365,  
                    0.12271486876021563, 0.10493759107053531,  
                    0.148620619335163, 0.19630013363169313,  
                    0.10121106560067765, 0.14292157983236384, 0.2, 0.2,  
                    0.01, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.01,  
                    0.12775995090941342, 0.01, 0.2, 0.2, 0.2,  
                    0.12938278774875223, 0.01, 0.18234379067839676, 0.2,  
                    0.2, 0.017801315218918375, 0.2, 0.2, 0.2, 0.01, 0.2,  
                    0.2, 0.2, 0.01, 0.2, 0.2, 0.2, 0.2,  
                    0.19599867154877326, 0.2, 0.15071862855135104, 0.2,  
                    0.01],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=1e+20),  
 'param_max_depth': masked_array(data=[9, 7, 5, 6, 10, 4, 5, 8, 4, 7, 3, 3, 10, 10, 5, 10, 4,  
                    6, 3, 3, 10, 5, 10, 3, 4, 10, 10, 4, 3, 3, 3, 3, 8, 3,  
                    5, 3, 6, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 10, 9],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=999999),  
 'param_min_data_in_leaf': masked_array(data=[21, 22, 34, 84, 89, 59, 60, 35, 64, 62, 20, 20, 20, 20,  
                    20, 20, 100, 74, 20, 45, 27, 45, 50, 40, 73, 76, 67,  
                    46, 20, 30, 20, 42, 83, 20, 31, 50, 77, 32, 100, 41,  
                    20, 20, 20, 20, 55, 25, 20, 99, 24, 77],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=999999),  
 'param_n_estimators': masked_array(data=[165, 66, 111, 116, 186, 133, 82, 140, 71, 84, 50, 50,  
                    50, 200, 69, 50, 98, 189, 200, 200, 200, 50, 50, 143,  
                    50, 200, 172, 148, 200, 178, 200, 50, 193, 200, 161,  
                    149, 116, 79, 200, 200, 200, 200, 118, 200, 200, 127,  
                    200, 52, 200, 200],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=999999),  
 'param_num_leaves': masked_array(data=[87, 68, 69, 53, 137, 107, 114, 49, 36, 135, 31, 31, 31,  
                    31, 150, 31, 31, 60, 31, 150, 110, 116, 54, 96, 135,  
                    143, 64, 77, 31, 41, 31, 31, 147, 150, 31, 150, 150,  
                    115, 31, 31, 31, 31, 31, 31, 150, 130, 31, 145, 148,  
                    113],  
              mask=[False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False, False, False, False, False, False, False,  
                    False, False],  
        fill_value=999999),  
 'params': [OrderedDict([('bagging_fraction', 0.7423028931466307),  
               ('bagging_freq', 2),  
               ('feature_fraction', 0.5901778432085792),  
               ('lambda_l1', 4.417711632029077),  
               ('lambda_l2', 2.645483128765203),  
               ('learning_rate', 0.18389356500243567),  
               ('max_depth', 9),  
               ('min_data_in_leaf', 21),  
               ('n_estimators', 165),  
               ('num_leaves', 87)]),  
  ......  
  OrderedDict([('bagging_fraction', 0.6462725250962974),  
               ('bagging_freq', 4),  
               ('feature_fraction', 1.0),  
               ('lambda_l1', 0.0),  
               ('lambda_l2', 0.0),  
               ('learning_rate', 0.01),  
               ('max_depth', 9),  
               ('min_data_in_leaf', 77),  
               ('n_estimators', 200),  
               ('num_leaves', 113)])],  
 'split0_test_score': array([0.84615385, 0.8974359 , 0.82692308, 0.82051282, 0.82371795,  
        0.84615385, 0.80128205, 0.82692308, 0.80769231, 0.81410256,  
        0.86538462, 0.84615385, 0.87820513, 0.79487179, 0.77564103,  
        0.79487179, 0.5       , 0.69871795, 0.87820513, 0.80769231,  
        0.74358974, 0.94230769, 0.79487179, 0.89102564, 0.76282051,  
        0.73076923, 0.88461538, 0.78846154, 0.87820513, 0.94230769,  
        0.79487179, 0.74358974, 0.83974359, 0.79487179, 0.82051282,  
        0.78205128, 0.86858974, 0.74358974, 0.76923077, 0.80128205,  
        0.85897436, 0.76923077, 0.80769231, 0.8525641 , 0.76923077,  
        0.83974359, 0.78846154, 0.5       , 0.81410256, 0.80448718]),  
 ......  
 'split9_test_score': array([0.875     , 0.85416667, 0.93055556, 0.85416667, 0.82986111,  
        0.81597222, 0.84722222, 0.85416667, 0.86111111, 0.83333333,  
        0.9375    , 0.93055556, 0.91666667, 0.94444444, 0.92361111,  
        0.95138889, 0.5       , 0.90277778, 0.95833333, 0.93055556,  
        0.93055556, 0.85763889, 0.88888889, 0.89583333, 0.86805556,  
        0.88194444, 0.86805556, 0.90277778, 0.92361111, 0.84027778,  
        0.9375    , 0.85416667, 0.85069444, 0.90972222, 0.9375    ,  
        0.90277778, 0.86111111, 0.84722222, 0.86805556, 0.93055556,  
        0.88888889, 0.94444444, 0.94444444, 0.92361111, 0.93055556,  
        0.95138889, 0.91666667, 0.5       , 0.93055556, 0.83680556]),  
 'mean_test_score': array([0.84129274, 0.83084936, 0.8528312 , 0.81605235, 0.80491453,  
        0.80790598, 0.83018162, 0.83135684, 0.83723291, 0.82676282,  
        0.85902778, 0.87451923, 0.85229701, 0.87451923, 0.86255342,  
        0.8491453 , 0.5       , 0.81650641, 0.88071581, 0.85774573,  
        0.84845085, 0.8278312 , 0.84337607, 0.83066239, 0.84139957,  
        0.83979701, 0.82393162, 0.85598291, 0.86720085, 0.83261218,  
        0.86367521, 0.82804487, 0.8178953 , 0.84620726, 0.85422009,  
        0.84535256, 0.79513889, 0.85288462, 0.81068376, 0.85689103,  
        0.84770299, 0.87366453, 0.86009615, 0.86853632, 0.85747863,  
        0.86116453, 0.84391026, 0.5       , 0.86372863, 0.79909188]),  
 'std_test_score': array([0.08750367, 0.07546453, 0.09216044, 0.07704383, 0.07383111,  
        0.07474289, 0.07184162, 0.08144329, 0.07742191, 0.07805255,  
        0.09560397, 0.05879003, 0.07462446, 0.08397139, 0.07797208,  
        0.07589878, 0.        , 0.08654942, 0.08115231, 0.08016027,  
        0.08754168, 0.07436563, 0.0841738 , 0.087052  , 0.06630022,  
        0.0739091 , 0.0809315 , 0.08322615, 0.07442955, 0.07927268,  
        0.08552709, 0.06945326, 0.07752712, 0.08794703, 0.10279878,  
        0.07906136, 0.07457856, 0.08119138, 0.06868138, 0.09512041,  
        0.07786052, 0.07988576, 0.08554799, 0.07778389, 0.08261794,  
        0.09295684, 0.10166394, 0.        , 0.07831002, 0.08603959]),  
 'rank_test_score': array([29, 34, 19, 43, 46, 45, 36, 33, 31, 39, 12,  3, 20,  2,  9, 21, 49,  
        42,  1, 13, 22, 38, 27, 35, 28, 30, 40, 16,  6, 32,  8, 37, 41, 24,  
        17, 25, 48, 18, 44, 15, 23,  4, 11,  5, 14, 10, 26, 49,  7, 47])}

这里的数据是在交叉验证过程中对模型超参数进行调优的结果,由于篇幅过大其中省略号为每折重复的过程具体查看代码输出,包括每个分割(split)的测试得分(splitX_test_score)、每个超参数(如bootstrap、max_depth等)的设置(param_X)、每次拟合和评分的平均时间(mean_fit_time和mean_score_time)、标准差(std_fit_time和std_score_time),以及最终的测试得分和排名(mean_test_score和rank_test_score)。每一项结果都展示了该模型在不同超参数配置下的表现,用于选择最优的参数和模型配置

  
# 提取最优模型每折成绩  
max_scores_lgbm = [cv_results_lgbm[f'split{i}_test_score'][best_rank_index] for i in range(10)]  
print("最优参数每折 AUC 成绩:", max_scores_lgbm )  
mean_score_lgbm = np.mean(max_scores_lgbm)  # 计算平均成绩  
std_score_lgbm = np.std(max_scores_lgbm)  # 计算标准差  
print("Mean of the best scores (across folds):", mean_score_lgbm)  
print("Standard deviation of the best scores (across folds):", std_score_lgbm)

提取最优参数下模型在每一折交叉验证中最优参数对应的AUC成绩,计算并输出这些成绩的平均值和标准差

  
最优参数每折 AUC 成绩: [0.8782051282051282, 0.8269230769230769, 0.8012820512820513, 0.7243589743589743, 0.8333333333333334, 0.888888888888889, 1.0, 0.9652777777777778, 0.9305555555555556, 0.9583333333333334]  
Mean of the best scores (across folds): 0.880715811965812  
Standard deviation of the best scores (across folds): 0.08115230532038907

picture.image

在10折交叉验证中,每一折训练后模型的最佳AUC分数被计算得出,AUC值越高代表模型性能越好。计算这些最佳AUC成绩的平均值,结果为0.8807,表示模型在所有交叉验证折中的整体表现,这也是我们在进行超参数优化时,依据最大平均AUC值来选择最佳参数。标准差为0.0812,反映了模型性能的稳定性。较低的标准差意味着模型在不同折间表现一致,而较高的标准差则表明模型在不同折上的性能波动较大,可以发现在参考文献中评价模型时也在参考这个标准差

接下来,同样的流程应用于XGBoost和RF模型的构建与优化即可

  
from sklearn.model_selection import cross_val_score  
from sklearn.linear_model import LogisticRegression  
from sklearn.ensemble import StackingClassifier  
  
# 初始化基础学习器  
base_learners = [  
    ('lgbm', best_lgbm_model),  
    ('xgb', best_xgb_model),  
    ('rf', best_rf_model)  
]  
  
# 使用 LogisticRegression 作为元模型,并且应用 L1 正则化 (Lasso)  
meta_model = LogisticRegression(penalty='l1', solver='liblinear', random_state=random_seed)  
  
# 创建 StackingClassifier  
stacking_model = StackingClassifier(  
    estimators=base_learners,   
    final_estimator=meta_model,   
    stack_method='predict_proba',  # 使用 predict_proba 提取概率作为元模型的输入  
    n_jobs=-1  
)  
  
# 拟合堆叠模型  
print("Training StackingClassifier...")  
stacking_model.fit(X_train, y_train)

使用LightGBM、XGBoost和RF作为基础学习器,并将逻辑回归作为元模型,利用L1正则化(Lasso回归)进行堆叠分类模型的构建和训练。文献中提到使用Lasso回归作为元模型,但在Python的StackingClassifier中,直接使用Lasso回归会报错,因为Lasso回归是回归模型,适用于连续性目标变量。为了解决这个问题,这里采用逻辑回归作为二分类元模型,并通过L1正则化使其具有Lasso回归的效果,从而实现堆叠模型

  
from sklearn.model_selection import KFold  
# 创建 KFold,显式设置随机种子  
kf = KFold(n_splits=10, shuffle=True, random_state=random_seed)  
  
# 使用10折交叉验证,并计算AUC  
print("Performing 10-fold cross-validation with AUC...")  
cv_auc_scores = cross_val_score(  
    stacking_model,   
    X_train,   
    y_train,   
    cv=kf,  # 使用自定义的KFold进行交叉验证  
    scoring='roc_auc',  # AUC作为评估指标  
    n_jobs=-1  # 并行计算  
)  
  
# 计算均值和标准差  
mean_score_stacking = np.mean(cv_auc_scores)  
std_score_stacking = np.std(cv_auc_scores)  
  
# 输出堆叠模型的均值和标准差  
print("Mean of the best AUC scores (across folds) for stacking model:", mean_score_stacking)  
print("Standard deviation of the best AUC scores (across folds) for stacking model:", std_score_stacking)

使用10折交叉验证对堆叠模型进行评估,计算并输出每折的AUC成绩的均值和标准差,评估堆叠模型在不同折上的性能表现和稳定性,具体参考前面单模型是一样的解释

  
Performing 10-fold cross-validation with AUC...  
Mean of the best AUC scores (across folds) for stacking model: 0.917703513806455  
Standard deviation of the best AUC scores (across folds) for stacking model: 0.0732624895779605
  
models_name = ['LGBM', 'XGB', 'RF', 'Stacking']  
mean_scores = [mean_score_lgbm, mean_score_xgb, mean_score_rf, mean_score_stacking]  
std_scores = [std_score_lgbm, std_score_xgb, std_score_rf, std_score_stacking]  
fig, ax1 = plt.subplots(figsize=(8, 6))  
ax1.set_xlabel('Models', fontsize=18, weight='bold')  
ax1.set_ylabel('Mean Score', fontsize=18, weight='bold', color='b')  
ax1.bar(models_name, mean_scores, color='b', alpha=0.6, label='Mean Score')  
ax1.tick_params(axis='y', labelsize=18, labelcolor='b')  
# 设置左y轴的范围  
ax1.set_ylim(0.7, 0.9)    
# 设置左y轴的刻度范围和间隔  
ax1.set_yticks(np.arange(0.7, 1.0, 0.05))    
ax2 = ax1.twinx()  
ax2.set_ylabel('Standard Deviation', fontsize=18, weight='bold', color='r')  
ax2.plot(models_name, std_scores, color='r', marker='o', label='Std Score')  
ax2.tick_params(axis='y', labelsize=18, labelcolor='r')  
plt.title('ROC 10-fold', fontsize=18, weight='bold')  
plt.savefig("ROC 10-fold.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

不同模型(LGBM、XGB、RF、Stacking)在10折交叉验证中的平均AUC成绩(蓝色柱状图)和标准差(红色折线)。可以看出,XGB模型的平均AUC成绩排名第三,而其标准差最低,表示其表现较为平稳;Stacking模型的AUC成绩最高且标准差较小,当然这个性能是在训练集中进行K折的结果,接下来看看模型在训练集、测试集、外部验证集上的具体性能结果,这里的外部验证集只是一个模拟数据集,并不具有现实意义,只是展示如何实现外部验证性能的计算

  
from sklearn.metrics import (accuracy_score, balanced_accuracy_score, brier_score_loss, f1_score,   
                             jaccard_score, cohen_kappa_score, matthews_corrcoef, precision_score,   
                             recall_score, roc_auc_score, confusion_matrix)  
  
# 定义计算各项评估指标的函数  
def calculate_metrics(model, X, y):  
    # 获取预测结果和概率  
    y_pred = model.predict(X)  
    y_pred_proba = model.predict_proba(X)[:, 1]  # 预测概率(用于AUC)  
  
    # 混淆矩阵  
    tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()  
  
    # 计算各个评估指标  
    accuracy = accuracy_score(y, y_pred)  
    balanced_accuracy = balanced_accuracy_score(y, y_pred)  
    brier_score = brier_score_loss(y, y_pred_proba)  
    f1 = f1_score(y, y_pred)  
    jaccard = jaccard_score(y, y_pred)  
    kappa = cohen_kappa_score(y, y_pred)  
    matthews = matthews_corrcoef(y, y_pred)  
    ppv = precision_score(y, y_pred)  # 阳性预测值(精确度)  
    npv = tn / (tn + fn)  # 阴性预测值  
    precision = ppv  
    recall = recall_score(y, y_pred)  
    auc = roc_auc_score(y, y_pred_proba)  
  
    # 返回指标结果,不包含患病率  
    return {  
        'Accuracy': accuracy,  
        'Balanced Accuracy': balanced_accuracy,  
        'Brier Score': brier_score,  
        'F1 Score': f1,  
        'Jaccard Index': jaccard,  
        'Cohen Kappa': kappa,  
        'Matthews Corr Coeff': matthews,  
        'PPV': ppv,  
        'NPV': npv,  
        'Precision': precision,  
        'Recall': recall,  
        'AUC': auc  
    }  
  
# 存储各个模型的评估结果  
models = {  
    'LGBM': best_lgbm_model,  
    'XGB': best_xgb_model,  
    'RF': best_rf_model,  
    'Stacking': stacking_model  
}  
  
results_test = {}  
  
for model_name, model in models.items():  
    print(f"Evaluating {model_name} on test set...")  
    metrics_test = calculate_metrics(model, X_test, y_test)  
    results_test[model_name] = metrics_test  
  
metrics_df_test = pd.DataFrame(results_test)  
# 读取外部验证  
path = r"外部验证-2025-8-23公众号Python机器学习AI.xlsx"  
external_validation_data = pd.read_excel(path)  
X_external_validation = external_validation_data.drop(columns=['Electrical_cardioversion'])    
y_external_validation = external_validation_data['Electrical_cardioversion']    
# 存储各个模型的评估结果  
results_external_validation = {}  
  
for model_name, model in models.items():  
    print(f"Evaluating {model_name} on external validation set...")  
    metrics_external_validation = calculate_metrics(model, X_external_validation, y_external_validation)  
    results_external_validation[model_name] = metrics_external_validation  
  
metrics_df_external_validation = pd.DataFrame(results_external_validation)  
# 存储各个模型的评估结果  
results_train = {}  
  
for model_name, model in models.items():  
    print(f"Evaluating {model_name} on training set...")  
    metrics_train = calculate_metrics(model, X_train, y_train)  
    results_train[model_name] = metrics_train  
  
metrics_df_train = pd.DataFrame(results_train

对四个模型(LGBM、XGB、RF、Stacking)在测试集、外部验证集和训练集上的多个评估指标(如准确率、AUC、F1分数等)进行计算,并将每个模型在不同数据集上的评估结果存储到DataFrame中

  
import seaborn as sns  
  
plt.rcParams.update({'font.size': 18, 'font.weight': 'bold'})  
plt.figure(figsize=(10, 8))  
sns.heatmap(metrics_df_test, annot=True, cmap='viridis', fmt='.3f', cbar=True)  
plt.title("Model Performance Metrics on Test Set", fontsize=18, weight='bold')  
plt.savefig("Model Performance Metrics on Test Set.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.rcParams.update({'font.size': 18, 'font.weight': 'bold'})  
plt.figure(figsize=(10, 8))  
sns.heatmap(metrics_df_external_validation, annot=True, cmap='viridis', fmt='.3f', cbar=True)  
plt.title("Model Performance Metrics on External Validation Set", fontsize=18, weight='bold')  
plt.savefig("Model Performance Metrics on External Validation Set.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.rcParams.update({'font.size': 18, 'font.weight': 'bold'})  
plt.figure(figsize=(10, 8))  
sns.heatmap(metrics_df_train, annot=True, cmap='viridis', fmt='.3f', cbar=True)  
plt.title("Model Performance Metrics on Training Set", fontsize=18, weight='bold')  
plt.savefig("Model Performance Metrics on Training Set.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

最后绘制三个热图,展示模型在训练集、测试集和外部验证集上的性能指标(如准确率、AUC等)。图中的结果来源于模拟数据集,不同数据集上的表现会有所不同。图表展示了文献中提到的堆叠模型在不同数据集上的性能差异

值得强调的是堆叠模型并不一定比单一模型好。关键在于每个数据集的特性,最适合该数据集的模型可能不是堆叠模型,而是某个单一模型。在很多情况下,堆叠模型确实可以提升模型性能,但它也带来了更高的复杂性,且并不是每个数据集都会从堆叠模型中获益。因此,选择最适合数据的模型,而非盲目追求堆叠模型,是提升性能的关键,后续将在下一期进行该文献ROC曲线和模型解释的详细讲解

当然,公众号中还有更多机器学习期刊实战技巧,您可以通过历史文章进行检索和阅读,关注公众号,点击“发信息”>“历史文章”即可搜索公众号所有文章信息

picture.image

✨ 该文章案例 ✨

picture.image

在上传至交流群的文件中,像往期文章一样,将对案例进行逐步分析,确保读者能够达到最佳的学习效果。内容都经过详细解读,帮助读者深入理解模型的实现过程和数据分析步骤,从而最大化学习成果。

同时,结合提供的免费AI聚合网站进行学习,能够让读者在理论与实践之间实现融会贯通,更加全面地掌握核心概念。

✨ 介绍 ✨

本节介绍到此结束,有需要学习数据分析和Python机器学习相关的朋友欢迎到淘宝店铺:Python机器学习AI,下方提供淘宝店铺二维码获取作者的公众号合集。截至目前为止,合集已包含近300多篇文章,购买合集的同时,还将提供免费稳定的AI大模型使用。

更新的内容包含数据、代码、注释和参考资料。 作者仅分享案例项目,不提供额外的答疑服务。项目中将提供详细的代码注释和丰富的解读,帮助您理解每个步骤 。 获取 前请咨询,避免不必要的问题。

✨ 群友反馈 ✨

picture.image

✨ 淘宝店铺 ✨

picture.image

请大家打开淘宝扫描上方的二维码,进入店铺,获取更多Python机器学习和AI相关的内容 ,希望能为您的学习之路提供帮助!

往期推荐

期刊复现:连续数据与分类数据共存的SHAP可视化散点图与箱形图组合形式

期刊复现:多分类任务如何拆分为二分类任务并进行堆叠预测提高模型预测性能

期刊配图:SHAP模型解释多种特征重要性柱状图可视化解析

期刊配图:SHAP值分析模型可解释性在柱状图与蜂窝图中的进阶组合展示

期刊配图:通过堆叠Mean|SHAP|展示不同区间对模型贡献度的可视化分析

期刊复现:利用UMAP降维算法可视化深度学习随着训练次数的增加模型区分能力的变化

期刊配图:PCA、t-SNE与UMAP三种降维方法简化高维数据的展示应用对比

Science期刊复现:分类、回归与Shap分析多角度揭示同一数据集变量对目标的影响

多模型SHAP+PDP解读Stacking集成模型:从基学习器到元学习器的可解释性与模型调参实现

picture.image

如果你对类似于这样的文章感兴趣。

欢迎关注、点赞、转发~

个人观点,仅供参考

0
0
0
0
评论
未登录
看完啦,登录分享一下感受吧~
暂无评论