随机森林在生物医学中的应用：如何用机器学习筛选疾病关键基因并实现可解释分析 - 文章 - 开发者社区

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍：随机森林在生物医学中的应用：如何用机器学习筛选疾病关键基因并实现可解释分析，数据采用模拟数据无任何现实意义，作者根据个人对机器学习的理解进行代码实现与图表输出，仅供参考。完整数据和代码将在稍后上传至交流群，成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。文末提供高效的学习工具~！

✨ 论文信息 ✨

picture.image

文献的这部分利用随机森林模型对衰老相关差异基因（ARDEGs）进行特征重要性分析，通过平均基尼指数下降（MDG）来衡量每个基因对分类准确性的贡献，从而筛选出最关键的特征基因（hub genes），用于后续生物学功能研究，这篇较早期的文献仅简要应用了随机森林机器学习方法，对衰老相关基因进行重要性筛选，用于辅助确定关键候选基因

picture.image

这是近期的文章，该研究通过LASSO、SVM-RFE和随机森林三种机器学习算法对PET-MPs相关的UC基因进行特征筛选，最终从多组交集基因中精准识别出4个关键致病基因（hub genes），并结合 SHAP 提供可解释性分析，用于构建UC风险预测模型，相比早期文章，这项最新研究不仅用多种机器学习筛选关键基因，还进一步加入SHAP可解释性分析，接下来利用RF模型在模拟基因数据上对相关基因进行重要性筛选，以及模型解释，对于这篇近期文章的实现流程放在下一章节进行讲解复现

✨ 模拟实现 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告  
warnings.filterwarnings("ignore")  
  
fpkm = pd.read_csv('fpkm.txt',   
                       sep='\t',        # 如果是制表符分隔；若是逗号，用 sep=','；若是空格，用 delim_whitespace=True  
                       encoding='utf-8',  
                  index_col=0 )  
  
clinical = pd.read_csv('clinical.txt',   
                      sep='\t',   
                      encoding='utf-8')  
                        
fpkm_t = fpkm.T  
merged = pd.merge(clinical, fpkm_t, left_on='accession', right_index=True, how='inner')  
merged

picture.image

读取基因表达矩阵和临床信息，将表达矩阵按样本转置后，根据样本编号（accession）将两者精确匹配并合并，最终生成一个同时包含临床特征与对应基因表达数据的综合分析数据集

  
from sklearn.model_selection import train_test_split  
  
# 划分特征和目标变量  
X = merged.drop(['fustat', 'accession'], axis=1)  # 特征变量 X  
y = merged['fustat']  # 提取目标变量   
  
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(  
    X,  # 特征变量  
    y,  # 目标变量  
    test_size=0.2,  # 测试集所占比例，这里为 20%  
    random_state=42,  # 随机种子，确保结果可重复  
    stratify=merged['fustat']  
)  
from sklearn.ensemble import RandomForestClassifier  
  
rf = RandomForestClassifier(random_state=42, n_estimators=500)  
rf.fit(X_train, y_train)

将数据划分训练集和测试集比例为8：2，并用与第一篇文献相同的参数（n_estimators=500）训练随机森林模型，但需强调的是实际研究中特征重要性排名通常更多来自默认参数模型，主要用于特征筛选与变量降维，并不直接用于模型解释

  
# 获取特征重要性  
feature_importances = rf.feature_importances_  
# 获取特征名称  
feature_names = X_train.columns  
# 按重要性排序，取前16个特征  
sorted_idx = np.argsort(feature_importances)[::-1][:16]  
  
plt.figure(figsize=(10, 6))  
for i in range(len(sorted_idx)):  
    plt.axhline(y=i, color='lightgray', linestyle='--', linewidth=1, alpha=0.6)  
  
plt.scatter(feature_importances[sorted_idx],   
            np.arange(len(sorted_idx)),   
            color='black')  
plt.yticks(np.arange(len(sorted_idx)),   
           feature_names[sorted_idx],   
           fontsize=16,   
           fontweight='bold')  
  
plt.title('importance of Variables for top 16 gens', fontsize=16, fontweight='bold')  
plt.xlabel('IncNodePurity', fontsize=16, fontweight='bold')  
plt.xticks(fontsize=16, fontweight='bold')  
plt.gca().invert_yaxis()  
  
plt.tight_layout()  
plt.savefig("importance of Variables for top 16 gens.pdf",   
            format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

基于随机森林模型计算所有基因的特征重要性，并按IncNodePurity排序提取排名前16的基因，通过可视化展示其相对贡献，这就是第一篇文献的做法，当然现在更多的是利用这个排名进行关键特征筛选与模型降维

  
from sklearn.feature_selection import RFECV  
  
# 按重要性排序，取前 30 个特征  
top30_idx = np.argsort(feature_importances)[::-1][:30]  
top30_features = feature_names[top30_idx]  
  
  
# 构建前 30 特征子集数据   
X_train_top30 = X_train[top30_features]  
X_test_top30  = X_test[top30_features]  
  
#  RFECV：在前30特征上递归特征消除  
rf = RandomForestClassifier(random_state=42, n_jobs=-1)  
  
rfecv = RFECV(  
    estimator=rf,  
    step=1,  
    cv=5,  
    scoring='roc_auc',  
    n_jobs=-1  
)  
  
rfecv.fit(X_train_top30, y_train)  
  
print("RFECV 最佳特征数量：", rfecv.n_features_)  
  
selected_features = top30_features[rfecv.support_]  
print("\nRFECV 选出的最终特征：")  
print(selected_features)

先从随机森林的重要性中选出前30个基因，再利用RFECV在这30个基因上执行5折交叉验证的递归特征消除，从中筛选出最优数量的关键特征基因

  
RFECV 最佳特征数量： 24  
RFECV 选出的最终特征：  
Index(['RGS1', 'IRS2', 'GABARAPL1', 'SAMSN1', 'NR4A2', 'NLRP3', 'NR4A3',         'SNORD89', 'PELI1', 'PDE4B', 'LARS1', 'TNFAIP3', 'NFIL3', 'STK17B',         'RCOR1', 'HBEGF', 'SYAP1', 'CREG1', 'NFKBIZ', 'CLUAP1', 'MAFF', 'STN1',         'OSM', 'FTH1'],  
      dtype='object')

该结果表明：在前30个候选基因中，RFECV通过5折交叉验证递归筛选后保留24个最具预测价值的特征基因，并将它们确定为最终的最优特征集合

  
X_train_rfe = X_train[selected_features]  
X_test_rfe  = X_test[selected_features]  
from sklearn.model_selection import GridSearchCV, StratifiedKFold  
  
# 基础模型  
rf = RandomForestClassifier(random_state=42, n_jobs=-1)  
  
# 超参数网格（可以根据需要再调节）  
param_grid = {  
    'n_estimators': [200, 500, 800],  
    'max_depth': [None, 5, 10, 20],  
    'min_samples_split': [2, 5, 10],  
    'min_samples_leaf': [1, 2, 4],  
    'max_features': ['sqrt', 'log2']  
}  
  
# K 折（分层，适合分类）  
cv = StratifiedKFold(  
    n_splits=5,   
    shuffle=True,   
    random_state=42  
)  
  
# 网格搜索 + 交叉验证  
grid_search = GridSearchCV(  
    estimator=rf,  
    param_grid=param_grid,  
    scoring='roc_auc',   # 和你前面 RFECV 保持一致  
    cv=cv,  
    n_jobs=-1,  
    verbose=2  
)  
  
# 在 RFE 之后的特征子集上拟合  
grid_search.fit(X_train_rfe, y_train)  
  
print("最佳参数：")  
print(grid_search.best_params_)  
print("\nCV 最佳 ROC AUC：", grid_search.best_score_)  
  
# 拿到最优模型  
best_rf = grid_search.best_estimator_

使用RFE筛选后的特征集，通过5折分层交叉验证的网格搜索优化随机森林超参数，最终获得AUC表现最优的随机森林模型

  
Fitting 5 folds for each of 216 candidates, totalling 1080 fits  
最佳参数：  
{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}  
CV 最佳 ROC AUC： 0.9773015873015872

该结果说明：在216组参数组合的5折交叉验证中，网格搜索最终选出n_estimators=200、max_depth=None、min_samples_leaf=2等最优参数组合，并取得了约0.977的平均AUC表现

  
from sklearn.metrics import roc_curve, roc_auc_score  
  
y_proba_test = best_rf.predict_proba(X_test_rfe)[:, 1]  
# 计算 FPR, TPR 和 AUC ——  
fpr, tpr, thresholds = roc_curve(y_test, y_proba_test)  
auc = roc_auc_score(y_test, y_proba_test)  
plt.figure(figsize=(8, 6))  
  
# 画最优模型的 ROC 曲线  
plt.plot(fpr, tpr, linewidth=2,  
         label=f"Best RF (AUC = {auc:.3f})")  
# 绘制随机分类的参考线  
plt.plot([0, 1], [0, 1], 'r--', linewidth=1.5, alpha=0.8, label="Random")  
# 图形细节设置  
plt.title("ROC Curve - Best RF Model (Test Set)", fontsize=20, fontweight="bold")  
plt.xlabel("False Positive Rate (1-Specificity)", fontsize=18)  
plt.ylabel("True Positive Rate (Sensitivity)", fontsize=18)  
plt.xticks(fontsize=16)  
plt.yticks(fontsize=16)  
plt.legend(loc="lower right", fontsize=12)  
# 去除顶部和右侧边框  
ax = plt.gca()  
ax.spines['top'].set_visible(False)  
ax.spines['right'].set_visible(False)  
ax.spines['left'].set_linewidth(1.5)  
ax.spines['bottom'].set_linewidth(1.5)  
# 关闭网格线  
plt.grid(False)  
plt.tight_layout()  
plt.savefig("ROC_best_RF_test.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

利用最优随机森林模型在测试集上计算并绘制ROC曲线，同时给出模型的AUC，以评估其分类性能，当然这里测试集的AUC值为1是由于这个模拟数据集的原因所致，真实的数据集一般而言不会存在完美预测

  
import shap  
explainer = shap.TreeExplainer(best_rf)  
shap_values = explainer.shap_values(X_test_rfe)  
# 提取类别 0 的 SHAP 值  
shap_values_class_0 = shap_values[:, :, 0]  
# 提取类别 1 的 SHAP 值  
shap_values_class_1 = shap_values[:, :, 1]  
plt.figure(figsize=(10, 5))  
shap.summary_plot(shap_values_class_1, X_test_rfe, plot_type="bar", show=False)  
plt.tight_layout()  
plt.savefig("summary_plot.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()