提升机器学习精度：利用SHAP值与蒙特卡洛模拟优化特征选择 - 文章 - 开发者社区

picture.image

背景

在机器学习模型的开发过程中，特征选择与组合是提升模型性能的关键一步，本文将探讨如何通过特征贡献度来优化模型的精度，利用XGBoost模型、SHAP值分析和蒙特卡洛模拟等技术，揭示不同特征组合下模型的表现

代码实现

数据读取处理


          
import pandas as pd
          
import numpy as np
          
import matplotlib.pyplot as plt
          
from sklearn.model_selection import train_test_split
          
df = pd.read_csv('Chabuhou.csv')
          

          
# 划分特征和目标变量
          
X = df.drop(['Electrical_cardioversion'], axis=1)
          
y = df['Electrical_cardioversion']
          
# 划分训练集和测试集
          
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
          
                                                    random_state=42, stratify=df['Electrical_cardioversion'])

使用一个心脏电复律的数据集，目标是预测特定条件下患者是否需要电复律，特征变量（如年龄、心率、血压等）经过清洗和预处理后，分为训练集和测试集

模型选择与参数优化


          
import xgboost as xgb
          
from sklearn.model_selection import GridSearchCV
          

          
# XGBoost模型参数
          
params_xgb = {
          
    'learning_rate': 0.02,            # 学习率，控制每一步的步长，用于防止过拟合。典型值范围：0.01 - 0.1
          
    'booster': 'gbtree',              # 提升方法，这里使用梯度提升树（Gradient Boosting Tree）
          
    'objective': 'binary:logistic',   # 损失函数，这里使用逻辑回归，用于二分类任务
          
    'max_leaves': 127,                # 每棵树的叶子节点数量，控制模型复杂度。较大值可以提高模型复杂度但可能导致过拟合
          
    'verbosity': 1,                   # 控制 XGBoost 输出信息的详细程度，0表示无输出，1表示输出进度信息
          
    'seed': 42,                       # 随机种子，用于重现模型的结果
          
    'nthread': -1,                    # 并行运算的线程数量，-1表示使用所有可用的CPU核心
          
    'colsample_bytree': 0.6,          # 每棵树随机选择的特征比例，用于增加模型的泛化能力
          
    'subsample': 0.7,                 # 每次迭代时随机选择的样本比例，用于增加模型的泛化能力
          
    'eval_metric': 'logloss'          # 评价指标，这里使用对数损失（logloss）
          
}
          

          

          
# 初始化XGBoost分类模型
          
model_xgb = xgb.XGBClassifier(**params_xgb)
          

          

          
# 定义参数网格，用于网格搜索
          
param_grid = {
          
    'n_estimators': [100, 200, 300, 400, 500],  # 树的数量
          
    'max_depth': [3, 4, 5, 6, 7],               # 树的深度
          
    'learning_rate': [0.01, 0.02, 0.05, 0.1],   # 学习率
          
}
          

          

          
# 使用GridSearchCV进行网格搜索和k折交叉验证
          
grid_search = GridSearchCV(
          
    estimator=model_xgb,
          
    param_grid=param_grid,
          
    scoring='neg_log_loss',  # 评价指标为负对数损失
          
    cv=5,                    # 5折交叉验证
          
    n_jobs=-1,               # 并行计算
          
    verbose=1                # 输出详细进度信息
          
)
          

          
# 训练模型
          
grid_search.fit(X_train, y_train)
          

          
# 输出最优参数
          
print("Best parameters found: ", grid_search.best_params_)
          
print("Best Log Loss score: ", -grid_search.best_score_)
          

          
# 使用最优参数训练模型
          
best_model = grid_search.best_estimator_
          

          
from sklearn.metrics import classification_report
          
pred = best_model.predict(X_test) # 预测测试集
          
print(classification_report(y_test, pred)) # 输出模型完整评价指标

picture.image

选用XGBoost分类模型，模型参数通过网格搜索优化，以找到最佳的学习率、树的深度等参数，最后输出此情况下的最优模型的详细评价指标

SHAP值分析特征贡献度

测试集特征贡献度排名可视化


          
import shap
          
# 构建 shap解释器
          
explainer = shap.TreeExplainer(best_model)
          
# 计算测试集的shap值
          
shap_values = explainer.shap_values(X_test)
          
# 特征标签
          
labels = X_train.columns
          
plt.rcParams['font.family'] = 'serif'
          
plt.rcParams['font.serif'] = 'Times new Roman'
          
plt.rcParams['font.size'] = 13
          

          
plt.figure(figsize=(15, 5))
          
shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
          
plt.title("X_test")
          
plt.xlabel('')  
          
plt.tight_layout()
          
plt.show()

picture.image

SHAP值是一种解释模型预测的方法，通过计算每个特征对预测结果的贡献度，我们可以了解哪些特征对模型的影响最大，并依据这些贡献度来排序特征，从而筛选出重要特征

特征排序


          
# 计算每个特征的平均绝对SHAP值
          
mean_shap_values = np.abs(shap_values).mean(axis=0)
          

          
# 获取特征标签
          
labels = X_train.columns
          

          
# 根据平均绝对SHAP值对特征进行排序
          
sorted_indices = np.argsort(mean_shap_values)[::-1]  # 从大到小排序
          
sorted_labels = labels[sorted_indices]
          
sorted_shap_values = mean_shap_values[sorted_indices]
          

          
# 对特征变量按照模型特征贡献度排序
          
X = df.drop(['Electrical_cardioversion'], axis=1).iloc[:, sorted_indices]

使用sorted_indices重新排列数据集的特征，使得最重要的特征排在前面，这样排列后的数据集便于后续在不同特征组合下评估模型性能，特别是在进行逐步特征添加和蒙特卡洛模拟时，可以优先考虑贡献度大的特征

蒙特卡洛模拟与特征组合


          
from sklearn.model_selection import cross_val_score
          

          
# 划分特征和目标变量
          
X = df.drop(['Electrical_cardioversion'], axis=1).iloc[:, sorted_indices]
          
y = df['Electrical_cardioversion']
          

          
# 设置随机种子
          
np.random.seed(42)
          
n_features = X.shape[1]
          
mc_no = 20  # 蒙特卡洛模拟的次数
          
cv_scores = np.zeros(n_features)  # 记录交叉验证分数
          

          
# 获取最佳模型的所有参数
          
best_params = best_model.get_params()
          

          
# 过滤出你感兴趣的参数，结合默认参数和网格搜索的最佳参数
          
params_xgb = {
          
    'learning_rate': best_params['learning_rate'],
          
    'booster': best_params['booster'],
          
    'objective': best_params['objective'],
          
    'max_leaves': best_params['max_leaves'],
          
    'verbosity': best_params['verbosity'],
          
    'seed': best_params['seed'],
          
    'nthread': best_params['nthread'],
          
    'colsample_bytree': best_params['colsample_bytree'],
          
    'subsample': best_params['subsample'],
          
    'eval_metric': best_params['eval_metric'],
          
    'n_estimators': best_params['n_estimators'],
          
    'max_depth': best_params['max_depth'],
          
}
          

          
model = xgb.XGBClassifier(**params_xgb)
          

          
# 蒙特卡洛模拟
          
for j in np.arange(mc_no):
          
    # 每次模拟都重新划分数据集
          
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.8, random_state=j)
          
    
          
    # 逐步增加特征数量并进行交叉验证
          
    for i in range(1, n_features + 1):
          
        X_train_subset = X_train.iloc[:, :i]
          
        scores = cross_val_score(model, X_train_subset, y_train, cv=5, scoring='accuracy', n_jobs=-1)
          
        cv_scores[i - 1] += scores.mean()
          

          
# 计算平均交叉验证分数
          
cv_scores /= mc_no
          

          
# 绘图
          
plt.figure(figsize=(10, 6))
          
plt.plot(np.arange(1, n_features + 1), cv_scores)
          
plt.xlabel('Number of features selected')
          
plt.ylabel('Cross validation score (correct classifications)')
          
plt.title('Feature Selection Impact on Model Performance (with Cross Validation)')
          
plt.grid(True)
          
plt.tight_layout()
          
plt.show()

picture.image

在确定了特征的贡献度之后，我们利用蒙特卡洛模拟，逐步增加特征数量，观察不同特征组合下模型的交叉验证分数，通过绘制不同特征组合下模型的交叉验证分数，可以直观地看到增加特征数量对模型性能的影响，可以发现特征并不是越多越好而是会存在一个峰值让模型精确度达到最高，可以参考文章了解数据维度泛滥与峰值现象实验

这里使用蒙特卡洛模拟的原因是它能够通过随机采样和重复实验，有效地应对数据中的不确定性，提供对模型在不同特征组合下性能的稳健估计，从而帮助优化特征选择和提升模型的整体表现


          
# 找到最优的特征数
          
optimal_feature_count = np.argmax(cv_scores) + 1  # 获取最佳特征数（加1是因为索引从0开始）
          
optimal_features = X.columns[:optimal_feature_count]  # 获取最佳特征对应的列名
          

          
# 输出最优的特征数和特征名称
          
print("Optimal number of features:", optimal_feature_count)
          
print("Optimal features:", optimal_features.tolist())  # 输出最佳特征的名称列表
          
print("Best CV score:", cv_scores[optimal_feature_count - 1])  # 输出最佳交叉验证分数