期刊配图：SHAP值双向条形图揭示各类别特征贡献 - 文章 - 开发者社区

picture.image

背景

picture.image

在机器学习的世界中，"黑箱"模型的解释性是一个备受关注的话题。尽管模型可以提供令人惊叹的预测效果，但如果无法解释这些预测背后的决策逻辑，便难以真正应用到实际场景中，特别是在农业、医疗等对决策透明度要求高的领域，需要更直观的方法来理解特征对模型预测的贡献

本文通过使用SHAP值的双向条形图，将复杂模型中的特征重要性以简单直观的方式展现出来，结合模拟数据再现类似图A的可视化效果，展示不同类别特征的重要性，并探讨这种可视化形式如何帮助解读模型决策机制，希望能为模型可解释性在shap可视化上提供一些启发！

代码实现

模型构建


          
import pandas as pd
          
import numpy as np
          
import matplotlib.pyplot as plt
          
plt.rcParams['font.family'] = 'Times New Roman'
          
plt.rcParams['axes.unicode_minus'] = False
          
from sklearn.model_selection import train_test_split
          
df = pd.read_excel('2025-1-14公众号Python机器学习AI.xlsx')
          
# 划分特征和目标变量
          
X = df.drop(['y'], axis=1)
          
y = df['y']
          
# 划分训练集和测试集
          
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
          
                                                    random_state=42, stratify=df['y'])
          
                                                    
          
from xgboost import XGBClassifier
          
from sklearn.model_selection import GridSearchCV, StratifiedKFold
          
from sklearn.metrics import accuracy_score
          

          
# 定义 XGBoost 二分类模型
          
model_xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=8)
          

          
# 定义参数网格
          
param_grid = {
          
    'n_estimators': [50, 100, 200],
          
    'max_depth': [3, 5, 7],
          
    'learning_rate': [0.01, 0.1, 0.2],
          
    'subsample': [0.8, 1.0],
          
    'colsample_bytree': [0.8, 1.0]
          
}
          

          
# 定义 K 折交叉验证 (Stratified K-Fold)
          
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
          

          
# 使用网格搜索寻找最佳参数
          
grid_search = GridSearchCV(estimator=model_xgb, param_grid=param_grid, scoring='accuracy',
          
                           cv=kfold, verbose=1, n_jobs=-1)
          

          
# 拟合模型
          
grid_search.fit(X_train, y_train)
          
# 使用最优参数训练模型
          
xgboost = grid_search.best_estimator_

通过网格搜索和交叉验证优化了XGBoost分类器的超参数，并使用最优参数在训练集上拟合了模型，为后续的预测和分析做好了准备

shap值计算整理


          
import shap
          
explainer = shap.TreeExplainer(xgboost)
          
# 计算shap值为numpy.array数组
          
shap_values_numpy = explainer.shap_values(X)
          
shap_values_df = pd.DataFrame(shap_values_numpy, columns=X.columns)
          

          
#计算 SHAP 值的绝对值
          
shap_values_abs = shap_values_df.abs()
          
#根据原始数据 df['y'] 分组，计算特征贡献度的绝对值均值
          
mean_abs_contributions = shap_values_abs.groupby(df['y']).mean()
          
mean_abs_contributions_transposed = mean_abs_contributions.T
          
mean_abs_contributions_transposed

picture.image