新算法：树模型与核方法融合的xRFM递归特征机在表格数据特征学习中的分类应用与SHAP解释 - 文章 - 开发者社区

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍：树模型与核方法融合的xRFM递归特征机在表格数据特征学习中的分类应用与SHAP解释，数据采用模拟数据无任何现实意义，作者根据个人对机器学习的理解进行代码实现与图表输出，仅供参考。完整数据和代码将在稍后上传至交流群，成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。文末提供高效的学习工具~！

✨ 论文信息 ✨

picture.image

xRFM是一种面向表格数据的特征学习模型，它通过结合核方法与树结构，递归地将训练数据根据投影方向分割成多个子集，并在每个叶子节点上训练局部递归特征机（RFM），从而既能适应数据的局部结构，实现对大规模训练数据的高效扩展，又能通过平均梯度外积方法提供天然的模型可解释性，最终在回归和分类任务中显著优于传统的梯度提升树（GBDT）和其他先进模型

前文已经进行了回归模型的构建——新算法：树模型与核方法融合的xRFM递归特征机在表格数据特征学习中的回归应用与SHAP解释，这里对该模型分类模型构建进行讲解

✨ 代码实现 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告  
warnings.filterwarnings("ignore")  
  
path = r"2025-12-16公众号Python机器学习AI.csv"  
df = pd.read_csv(path)  
from sklearn.model_selection import train_test_split  
  
X = df.drop(['target'],axis=1)  
y = df['target']  
  
# 划分训练集和测试集  
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=df['target'])  
# 然后将训练集进一步划分为训练集和验证集  
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.125, random_state=42, stratify=y_temp)  # 0.125 x 0.8 = 0.1

将原始数据集按比例分成训练集、验证集和测试集，其中测试集占20%，验证集占训练集的12.5%（即整体的10%），并且在划分时保持目标变量 target 的类别分布一致（分层抽样），模拟数据中既包含类别型特征（分类变量），也包含连续型特征（数值型变量）

  
from sklearn.preprocessing import StandardScaler, OneHotEncoder  
# 特征列划分  
num_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']  # 连续特征  
cat_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']  # 类别特征  
  
# 标准化连续特征（只用训练集fit）  
scaler = StandardScaler()  
X_num_train = scaler.fit_transform(X_train[num_cols])  
X_num_val = scaler.transform(X_val[num_cols])  
X_num_test = scaler.transform(X_test[num_cols])  
  
# 独热编码类别特征（只用训练集fit）  
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)  
X_cat_train = ohe.fit_transform(X_train[cat_cols])  
X_cat_val = ohe.transform(X_val[cat_cols])  
X_cat_test = ohe.transform(X_test[cat_cols])  
  
# 拼接数值特征和类别特征，类别特征在后  
X_train_processed = np.hstack([X_num_train, X_cat_train]).astype(np.float32)  
X_val_processed = np.hstack([X_num_val, X_cat_val]).astype(np.float32)  
X_test_processed = np.hstack([X_num_test, X_cat_test]).astype(np.float32)  
  
y_train_array = y_train.to_numpy().astype(np.int64)  
y_val_array = y_val.to_numpy().astype(np.int64)  
y_test_array = y_test.to_numpy().astype(np.int64)

对训练集的连续特征进行标准化，对类别特征进行独热编码，并将处理后的数值和类别特征拼接成最终的训练、验证和测试特征矩阵

特征本身是以数值编码形式存在的，这里处理为独热编码，将离散类别转换为二进制指示变量；虽然这样处理方便模型理解类别信息，但独热编码会显著增加特征维度，可能导致维度泛滥（维度爆炸）问题，尤其当类别数量较多时

  
import torch  
  
# 数值特征数量  
n_num = X_num_train.shape[1]  
  
categorical_indices = []  
categorical_vectors = []  
  
start = n_num  # 类别特征起始索引，紧跟数值特征后面  
  
for cats in ohe.categories_:  
    cat_len = len(cats)  
    # 当前类别特征在整体特征中的索引范围  
    idxs = torch.arange(start, start + cat_len, dtype=torch.long)  
    categorical_indices.append(idxs)  
  
    # 对应类别特征的one-hot单位矩阵  
    categorical_vectors.append(torch.eye(cat_len, dtype=torch.float32))  
  
    start += cat_len  # 更新下一个类别特征起始索引  
  
# 数值特征索引tensor  
numerical_indices = torch.arange(0, n_num, dtype=torch.long)  
  
categorical_info = dict(  
    numerical_indices=numerical_indices,  
    categorical_indices=categorical_indices,  
    categorical_vectors=categorical_vectors,  
)  
  
print("Numerical indices:", numerical_indices)  
print("Categorical indices:", categorical_indices)  
print("Categorical vectors sizes:", [v.shape for v in categorical_vectors])

基于独热编码后的特征矩阵构建数值特征和每个类别特征在整体特征中的索引及其对应的单位矩阵，用于后续在PyTorch中方便地定位和处理数值与类别特征

  
Numerical indices: tensor([0, 1, 2, 3, 4])  
Categorical indices: [tensor([5, 6]), tensor([ 7,  8,  9, 10]), tensor([11, 12]), tensor([13, 14, 15]), tensor([16, 17]), tensor([18, 19, 20]), tensor([21, 22, 23, 24]), tensor([25, 26, 27])]  
Categorical vectors sizes: [torch.Size([2, 2]), torch.Size([4, 4]), torch.Size([2, 2]), torch.Size([3, 3]), torch.Size([2, 2]), torch.Size([3, 3]), torch.Size([4, 4]), torch.Size([3, 3])]

Numerical indices: tensor([0, 1, 2, 3, 4]) 表示数值特征对应整体特征矩阵的前5个索引（共5个连续特征），Categorical indices是一个列表，包含8个张量，每个张量表示一个类别特征在整体特征中的索引范围，长度与该类别特征的独热编码维度一致（如第一个类别特征占索引5和6，共2维）

  
from xrfm import xRFM  
  
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
  
rfm_params = {  
    'model': {  
        'kernel': 'l2',  
        'bandwidth': 10.0,  
        'exponent': 1.0,  
        'diag': False,  
        'bandwidth_mode': 'constant',  
    },  
    'fit': {  
        'reg': 1e-3,  
        'iters': 3,  
        'verbose': True,  
        'early_stop_rfm': True,  
    }  
}  
  
model = xRFM(  
    rfm_params=rfm_params,  
    device=device,  
    tuning_metric='accuracy',  # 二分类优化指标  
    categorical_info=categorical_info,  
    task='classification',     # 明确分类任务  
)  
  
model.fit(X_train_processed, y_train_array, X_val_processed, y_val_array)

基于给定的参数配置和类别特征信息，初始化并训练一个基于xRFM的二分类模型，利用训练集和验证集进行拟合和调优

picture.image

可视化通过混淆矩阵展示xRFM模型在测试集上的分类准确度和误判情况，同时通过ROC曲线及其0.924的AUC值反映模型在区分正负样本上的优秀性能

picture.image

SHAP总结图展示各特征对模型预测类别1的影响程度和方向，颜色表示特征取值的高低，红色（高值）和蓝色（低值）点分布在正负两侧表明特征值对预测结果有不同影响；对于分类模型，SHAP值既可以针对所有类别计算，也可以只针对感兴趣的某一类别计算，这里展示的是针对类别1的SHAP值分析

  
import seaborn as sns  
plt.figure(figsize=(8,6))  
sns.boxplot(  
    x=X_test_processed_df['ca_0'],   
    y=shap_values_df['ca_0'],  
    palette=['#4c72b0', '#55a868'],  
    width=0.3  
)  
  
plt.xlabel('ca_0', fontsize=18, fontweight='bold')  
plt.ylabel('SHAP value for ca_0', fontsize=18, fontweight='bold')  
plt.title('SHAP Values Distribution by ca_0 Category', fontsize=18, fontweight='bold')  
  
plt.xticks(ticks=[0, 1], labels=['Not ca_0', 'ca_0'], fontsize=18, fontweight='bold')  
  
plt.grid(axis='y', linestyle='--', alpha=0.7)  
  
sns.despine(offset=10, trim=True)  
  
plt.tight_layout()  
plt.savefig("shap_4.PDF", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()  
plt.figure(figsize=(8,6))  
sns.boxplot(  
    x=X_test_processed_df['cp_4'],   
    y=shap_values_df['cp_4'],  
    palette=['#4c72b0', '#55a868'],  
    width=0.3  
)  
  
plt.xlabel('cp_4', fontsize=18, fontweight='bold')  
plt.ylabel('SHAP value for cp_4', fontsize=18, fontweight='bold')  
plt.title('SHAP Values Distribution by cp_4 Category', fontsize=18, fontweight='bold')  
  
plt.xticks(ticks=[0, 1], labels=['Not cp_4', 'cp_4'], fontsize=18, fontweight='bold')  
  
plt.grid(axis='y', linestyle='--', alpha=0.7)  
  
sns.despine(offset=10, trim=True)  
  
plt.tight_layout()  
plt.savefig("shap_5.PDF", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

通过箱型图可视化模型中贡献排名前二的类别型特征ca_0和cp_4在不同类别（0和1）下对应的SHAP值分布，展示这两个特征的不同类别对模型输出的影响差异

特征ca_0的取值为1时，SHAP值总体偏负，表示该类别倾向于降低模型预测的输出概率（也就是预测为类别0）；取值为0时，SHAP值分布更广且偏正，说明该类别对模型输出的影响更不确定且可能提升预测结果

  
plot_shap_dependence(feature_name='oldpeak', X_data=X_test_processed_df, shap_values_df=shap_values_df, poly_degree=2,   
                     save_pdf_name="SHAP_2.pdf", legend_location='upper left')  
plot_shap_dependence(feature_name='trestbps', X_data=X_test_processed_df, shap_values_df=shap_values_df, poly_degree=2,   
                     save_pdf_name="SHAP_3.pdf", legend_location='upper left')