期刊配图：用变量热图解读模型预测样本特征与预测结果的可视化分析

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍：用变量热图解读模型预测样本特征与预测结果的可视化分析，数据采用模拟数据无任何现实意义，作者根据个人对机器学习的理解进行代码实现与图表输出，仅供参考。完整数据和代码将在稍后上传至交流群，成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。文末提供高效的学习工具~！

✨ 论文信息 ✨

picture.image

这类多变量热图通过将每个样本的多维特征、模型输入信息及预测结果在同一图中逐行展示，使得不同变量之间的模式、关联及其对模型预测的潜在贡献能够被直观观察。第一类图强调临床、生物学指标（如分期、组织学、肿瘤大小、ctDNA 等）在患者队列中的整体分布及其与关键分子事件的关系；第二类图则进一步结合机器学习模型输出，将样本特征、模型预测概率、真实结局以及训练/测试集分组整合展示，有助于快速识别模型表现良好或容易出错的样本特征模式。这种可视化能够让研究者从全局视角理解变量间的结构化关系，并为模型解释、特征选择和生物学推断提供依据

这类多变量热图在R语言中已有成熟的可视化工具（如ComplexHeatmap、oncoprint等包）可直接调用，这里则基于Python自行构建等效的可视化实现。通过灵活控制轨道类型、颜色映射及布局，重现类似的展示效果，使临床特征、分子指标、模型输入变量以及机器学习预测结果能够在同一框架下直观呈现

✨ 代码实现 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告 公众号：Python机器学习AI  
warnings.filterwarnings("ignore")  
  
path = r"2025-12-11公众号Python机器学习AI.csv"  
df = pd.read_csv(path)  
  
# 每个类别具体对应  
df['sex_label'] = df['sex'].map({1.0: 'Male', 0.0: 'Female'})  
df['cp_label'] = df['cp'].map({1.0: 'Typical Angina', 2.0: 'Atypical Angina', 3.0: 'Non-anginal Pain', 4.0: 'Asymptomatic'})  
df['slope_label'] = df['slope'].map({1.0: 'Upsloping', 2.0: 'Flat', 3.0: 'Downsloping'})  
df['thal_label'] = df['thal'].map({0: 'Normal', 1: 'Fixed Defect', 2: 'Reversable Defect'})  
df['ca_label'] = df['ca'].map({0: '0 vessel', 1: '1 vessel', 2: '2 vessels', 3: '3 vessels'})  
df['Disease_Status'] = df['target'].map({0: 'No Disease', 1: 'Disease'})  
  
# 颜色与配置   
colors_sex = {'Male': '#1976D2', 'Female': '#E91E63'}  
colors_cp = {'Typical Angina': '#F44336', 'Atypical Angina': '#FF9800', 'Non-anginal Pain': '#FFEB3B', 'Asymptomatic': '#4CAF50'}  
colors_thal = {'Normal': '#81C784', 'Fixed Defect': '#EF5350', 'Reversable Defect': '#AB47BC'}  
colors_slope = {'Upsloping': '#AED581', 'Flat': '#FFD54F', 'Downsloping': '#E57373'}  
cmap_bps = mcolors.LinearSegmentedColormap.from_list("bp", ["#FFF3E0", "#BF360C"])  
cmap_chol = mcolors.LinearSegmentedColormap.from_list("chol", ["#E8F5E9", "#1B5E20"])  
colors_ca = {  
    '0 vessel':  '#D7CCC8',  # 浅棕色 (最轻)  
    '1 vessel':  '#A1887F',  # 中棕色  
    '2 vessels': '#6D4C41',  # 深棕色  
    '3 vessels': '#3E2723'   # 极深棕/黑棕色 (最重)  
}  
colors_disease = {'No Disease': '#E0E0E0', 'Disease': '#D32F2F'}   
df

picture.image

读取数据集，对原始数值型特征进行标签化与可视化所需的颜色映射整理，为后续构建多变量热图（变量可视化）做好数据预处理和配色配置

  
selected_config = [  
    # 1. Age (Bar)  
    {'column': 'age', 'type': 'bar', 'label': 'Age', 'height_ratio': 1.5,  
     'color_map': '#546E7A', 'ylim': [20, 80]},  
  
    # 2. Sex  
    {'column': 'sex_label', 'type': 'categorical', 'label': 'Sex',  
     'height_ratio': 0.8, 'color_map': colors_sex},  
  
    # 3. CP  
    {'column': 'cp_label', 'type': 'categorical', 'label': 'Chest Pain',  
     'height_ratio': 0.8, 'color_map': colors_cp},  
  
    # 4. Resting BP (Continuous)  
    {'column': 'trestbps', 'type': 'continuous', 'label': 'Resting BP',  
     'height_ratio': 0.8, 'color_map': cmap_bps, 'range': [90, 180]},  
  
    # 5. Cholesterol (Continuous)  
    {'column': 'chol', 'type': 'continuous', 'label': 'Cholesterol',  
     'height_ratio': 0.8, 'color_map': cmap_chol, 'range': [120, 400]},  
  
    # 6. Slope  
    {'column': 'slope_label', 'type': 'categorical', 'label': 'Slope',  
     'height_ratio': 0.8, 'color_map': colors_slope},  
  
    # 7. CA   
    {'column': 'ca_label', 'type': 'categorical', 'label': 'CA',  
     'height_ratio': 0.8, 'color_map': colors_ca},  
  
    # 8. Thal  
    {'column': 'thal_label', 'type': 'categorical', 'label': 'Thal',  
     'height_ratio': 0.8, 'color_map': colors_thal},  
  
    # 9. Target  
    {'column': 'Disease_Status', 'type': 'categorical', 'label': 'Target',  
     'height_ratio': 1.0, 'color_map': colors_disease},  
]

这里构建一个包含年龄、性别、胸痛类型、血压、胆固醇、斜率、血管数量、Thal 类型及疾病状态等临床特征的可视化配置列表，明确指定每个变量在多轨道热图（Oncoprint 风格可视化）中应采用的绘图类型（条形图、连续色带、分类色块）、颜色映射、显示标签及轨道高度，从而为后续按这些设定自动生成结构化、可解释的样本特征可视化图提供完整参数框架

  
fig, axes = plot_custom_oncoprint(  
    df,  
    selected_config,  
  
    # Figure layout（画布布局：控制整体图像大小）  
    figsize=(22, 14),  
  
    # Legend layout（图例布局：控制列数与列宽）  
    legend_n_cols=5,         # 图例按 5 列排列  
    legend_col_width=0.14,   # 单列图例占用的宽度比例  
  
    # X-axis labeling（X轴标签密度）  
    xlabel_step=10,          # 每隔 10 个样本显示一个名称  
  
    # Legend scaling（图例整体缩放）  
    legend_scale=1.5,        # 图例字体与色块放大  
    output_path='Heat map-1.pdf'  
)  
  
plt.show()

picture.image

调用自定义plot_custom_oncoprint函数，根据前面构建好的特征可视化配置selected_config，结合设定的画布大小、图例布局方式、X 轴标签密度与图例缩放比例，生成一个多轨道的临床与模型变量热图

可视化图通过热图展示了所有样本在年龄、性别、胸痛类型、血压、胆固醇、心电斜率、血管数量、Thal 状态以及疾病结局等关键临床特征上的整体分布与模式，用于直观观察变量间的差异及其与心脏病状态的关系，这里和模型没有关联，接下来可以加入模型预测信息

  
from sklearn.model_selection import train_test_split  
# 划分特征和目标变量  
X = df[["age", 'sex', 'cp', 'trestbps', 'chol', 'slope', 'ca', 'thal']]  
y = df[['target']]    
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(  
    X,    
    y,   
    test_size=0.3,   
    random_state=42,  
    stratify=df['target']  
)  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  
  
# 构建随机森林模型（使用默认参数）  
rf_clf = RandomForestClassifier(random_state=42)  
  
#在训练集上训练模型  
rf_clf.fit(X_train, y_train)    
df['set_type'] = 'train'                 # 默认全部为 train  
df.loc[X_test.index, 'set_type'] = 'test'  # 测试集 index 设为 test  
# 对训练集和测试集分别预测  
y_pred_train = rf_clf.predict(X_train)  
y_pred_test  = rf_clf.predict(X_test)  
  
df['predicted'] = np.nan  
df['correct']   = np.nan  
  
# 训练集预测结果  
df.loc[X_train.index, 'predicted'] = y_pred_train  
df.loc[X_train.index, 'correct']   = (  
    df.loc[X_train.index, 'predicted'] == y_train['target']  
)  
  
# 测试集预测结果  
df.loc[X_test.index, 'predicted'] = y_pred_test  
df.loc[X_test.index, 'correct']   = (  
    df.loc[X_test.index, 'predicted'] == y_test['target']  
)  
df

picture.image

利用随机森林模型对心脏病数据进行训练与预测，并将每个样本属于训练集或测试集、模型预测结果以及预测是否正确的标记写回原始数据框df中

  
colors_set_type = {  
    'train': '#90A4AE',   # 蓝灰色  
    'test':  '#FF7043'    # 深橙色  
}  
colors_correct = {  
    True:  '#66BB6A',   # 正确   
    False: '#EF5350'    # 错误   
}  
selected_config = [  
    # 1. Age (Bar)  
    {'column': 'age', 'type': 'bar', 'label': 'Age', 'height_ratio': 0.5,  
     'color_map': '#546E7A', 'ylim': [20, 80]},  
  
    # 2. Sex  
    {'column': 'sex_label', 'type': 'categorical', 'label': 'Sex',  
     'height_ratio': 0.5, 'color_map': colors_sex},  
  
    # 3. CP  
    {'column': 'cp_label', 'type': 'categorical', 'label': 'Chest Pain',  
     'height_ratio': 0.5, 'color_map': colors_cp},  
  
    # 4. Resting BP  
    {'column': 'trestbps', 'type': 'continuous', 'label': 'Resting BP',  
     'height_ratio': 0.5, 'color_map': cmap_bps, 'range': [90, 180]},  
  
    # 5. Cholesterol  
    {'column': 'chol', 'type': 'continuous', 'label': 'Cholesterol',  
     'height_ratio': 0.5, 'color_map': cmap_chol, 'range': [120, 400]},  
  
    # 6. Slope  
    {'column': 'slope_label', 'type': 'categorical', 'label': 'Slope',  
     'height_ratio': 0.5, 'color_map': colors_slope},  
  
    # 7. CA  
    {'column': 'ca_label', 'type': 'categorical', 'label': 'CA',  
     'height_ratio': 0.5, 'color_map': colors_ca},  
  
    # 8. Thal  
    {'column': 'thal_label', 'type': 'categorical', 'label': 'Thal',  
     'height_ratio': 0.5, 'color_map': colors_thal},  
  
    # 9. Target  
    {'column': 'Disease_Status', 'type': 'categorical', 'label': 'Target',  
     'height_ratio': 0.5, 'color_map': colors_disease},  
  
    # 10. Set Type（新增）  
    {'column': 'set_type', 'type': 'categorical', 'label': 'Set Type',  
     'height_ratio': 0.5, 'color_map': colors_set_type},  
  
    # 11. Correct（新增）  
    {'column': 'correct', 'type': 'categorical', 'label': 'Correct',  
     'height_ratio': 0.5, 'color_map': colors_correct},  
]

定义用于绘制多变量热图的可视化配置，并在原有临床特征与疾病状态的基础上新增数据集来（train/test）与预测正确性（correct/incorrect）两个轨道，使可视化不仅展示样本特征，也体现模型预测表现的分布

  
fig, axes = plot_custom_oncoprint(  
    df,  
    selected_config,  
  
    # Figure layout（画布布局：控制整体图像大小）  
    figsize=(22, 14),  
  
    # Legend layout（图例布局：控制列数与列宽）  
    legend_n_cols=5,         # 图例按 5 列排列  
    legend_col_width=0.13,   # 单列图例占用的宽度比例  
  
    # X-axis labeling（X轴标签密度）  
    xlabel_step=10,          # 每隔 10 个样本显示一个名称  
  
    # Legend scaling（图例整体缩放）  
    legend_scale=1.5,        # 图例字体与色块放大  
    output_path='Heat map-2.pdf'  
)  
  
plt.show()

picture.image

通过可视化新增的Set Type和Correct两列，可以直接从整体样本分布中观察模型在训练集与测试集上的预测表现差异：

Set Type（train/test）一眼区分哪些样本来自训练集、哪些来自测试集，从而判断模型是否在两类数据上的表现一致，是否存在过拟合现象；
Correct（True/False）则显示每个样本的预测是否正确，能够快速定位模型错误预测集中在哪些样本，并进一步结合上方的特征轨道分析这些错误是否与特定临床特征模式相关，从而帮助理解模型弱点与潜在偏差，这里就能看出预测错误的样本更多在测试集上出现

当然，公众号中还有更多机器学习期刊实战技巧，您可以通过历史文章进行检索和阅读，关注公众号，点击“发信息”>“历史文章”即可搜索公众号所有文章信息

picture.image