期刊复现：ML模型的超参数优化及模型预测效果可视化配图 - 文章 - 开发者社区

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍：ML模型的超参数优化及模型预测效果可视化配图，数据采用模拟数据无任何现实意义，作者根据个人对机器学习的理解进行代码实现与图表输出，仅供参考。完整数据和代码将在稍后上传至交流群，成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。文末提供高效的学习工具~！点赞、推荐参与文末包邮赠书！

✨ 论文信息 ✨

picture.image

在文献中，采用了三种在环境与材料领域广泛应用的树模型——决策回归树、RF和GBDT，用于分析主要特征对汞⁰最大去除效率的影响，并对其进行预测

影响树模型性能的关键超参数包括：max_depth单棵树的最大深度、n_estimators集成模型中决策树的棵数（RF和GBDT）所以在文献中，具体的搜索范围设定为：DRT：max_depth ∈ [1, 20]（也就是DT决策树，文献中简写为DRT回归决策树），RF、GBDT：max_depth ∈ [1, 20]，n_estimators ∈ [1, 100]，其他参数如min_samples_split和min_samples_leaf采用默认值

为充分利用有限样本并获得稳定可靠的结果，对三个模型均采用5折交叉验证进行训练与调参。该策略可以让数据库中的每一组数据都参与训练与验证；减少单次训练–测试随机划分带来的偶然性；降低模型过拟合风险，使评估结果更加稳健。在超参数搜索过程中，选取RMSE作为模型优劣的主要指标：RMSE最小值对应的参数组即视为该模型的最优超参数，确定最优超参数后，再使用这些参数对ML模型进行重新训练，得到最终的最优模型

✨ 代码实现 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告  
warnings.filterwarnings("ignore")  
  
path = r"2025-12-8公众号Python机器学习AI.xlsx"  
df = pd.read_excel(path)  
from sklearn.model_selection import train_test_split  
# 划分特征和目标变量  
X = df.drop(['SR'], axis=1)  # 从数据集中去掉目标变量 'y'，得到特征变量 X  
y = df['SR']  # 提取目标变量 y  
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  
from sklearn.tree import DecisionTreeRegressor  
from sklearn.model_selection import cross_val_score  
from sklearn.metrics import mean_squared_error  
  
# 初始化DecisionTreeRegressor  
model = DecisionTreeRegressor(random_state=42)  
  
# 记录不同max_depth对应的RMSE  
depths = list(range(1, 21))  # max_depth范围从1到20  
rmse_scores = []  
  
# 对不同的max_depth值进行5折交叉验证  
for depth in depths:  
    model.set_params(max_depth=depth)  # 设置当前的max_depth  
    # 使用5折交叉验证，评估模型的RMSE  
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')  
    rmse = np.sqrt(-scores.mean())  # 计算RMSE  
    rmse_scores.append(rmse)

在训练集上对决策树回归模型分别设置不同的max_depth（1–20），使用5折交叉验证计算每个深度对应的RMSE，以评估模型复杂度对预测误差的影响，从而为后续选择最优树深度提供依据

  
results = pd.DataFrame({'max_depth': depths, 'RMSE': rmse_scores})  
max_depth = results.head(10)['max_depth']  
rmse = results.head(10)['RMSE']  
  
min_rmse = rmse.min()  
  
fig, ax = plt.subplots(figsize=(6, 5), dpi=100)  
  
ax.bar(max_depth, rmse, color='#009afe', edgecolor='black', width=0.7, zorder=2)  
ax.plot(max_depth, rmse, color='black', linestyle='--', linewidth=1, marker='o', markersize=3, zorder=3)  
  
for i, val in enumerate(rmse):  
    x_pos = max_depth.iloc[i]  
    is_min = (val == min_rmse)  
    text_color = 'red' if is_min else 'black'  
    font_weight = 'bold'  
    label_text = f"{val:.2f}"  
    ax.text(x_pos, val + 0.3, label_text, ha='center', va='bottom',  
            fontsize=12, color=text_color, fontweight=font_weight)  
  
ax.set_ylim(10, 20)  
ax.set_xticks(max_depth)  
  
ax.set_xlabel('max_depth', fontsize=14, fontweight='bold')  
ax.set_ylabel('RMSE', fontsize=14, fontweight='bold')  
  
ax.tick_params(axis='both', which='major', labelsize=12, width=1.2)  
for label in ax.get_xticklabels() + ax.get_yticklabels():  
    label.set_fontweight('bold')  
  
for spine in ax.spines.values():  
    spine.set_linewidth(1.2)  
    spine.set_color('black')  
  
plt.tight_layout()  
plt.savefig("DT-1.pdf", format='pdf', bbox_inches='tight', dpi=1200)  
plt.show()

picture.image

将决策树模型前10组max_depth的RMSE结果以柱状图和折线相结合的方式可视化（与文献图示风格一致），并突出标记最小RMSE，用于展示超参数对模型性能的影响

  
# 假设通过之前的交叉验证，得到了最优的max_depth  
best_max_depth = results.loc[results['RMSE'].idxmin(), 'max_depth']  
# 使用最优的max_depth参数构建DRT模型  
DT = DecisionTreeRegressor(max_depth=best_max_depth, random_state=42)  
# 在整个训练集上训练模型  
DT.fit(X_train, y_train)

根据交叉验证得到的最优max_depth构建并训练最终的决策树回归模型（DRT）

  
from sklearn.metrics import r2_score  
# 预测  
y_train_pred = DT.predict(X_train)  
y_pred = DT.predict(X_test)  
  
# 计算训练集和测试集的性能指标  
mse_train = mean_squared_error(y_train, y_train_pred)  
rmse_train = np.sqrt(mse_train)  
r2_train = r2_score(y_train, y_train_pred)  
  
mse_test = mean_squared_error(y_test, y_pred)  
rmse_test = np.sqrt(mse_test)  
r2_test = r2_score(y_test, y_pred)  
  
# 打印性能指标  
print(f'训练集Mean Squared Error (MSE): {mse_train}')  
print(f'训练集Root Mean Squared Error (RMSE): {rmse_train}')  
print(f'训练集R-squared (R2): {r2_train}')  
  
print(f'-------------------------')  
  
print(f'测试集Mean Squared Error (MSE): {mse_test}')  
print(f'测试集Root Mean Squared Error (RMSE): {rmse_test}')  
print(f'测试集R-squared (R2): {r2_test}')

利用最优决策树模型分别在训练集和测试集上计算并输出 MSE、RMSE 和R²，用于评估模型的拟合能力和泛化性能

  
训练集Mean Squared Error (MSE): 61.71234047446087  
训练集Root Mean Squared Error (RMSE): 7.855720239065344  
训练集R-squared (R2): 0.921249032402706  
-------------------------  
测试集Mean Squared Error (MSE): 129.2040259387872  
测试集Root Mean Squared Error (RMSE): 11.366794884169732  
测试集R-squared (R2): 0.855267658828811

picture.image

从结果可以看出，基于最优超参数训练得到的决策树模型在训练集上表现良好（R²=0.92，RMSE≈7.86），在测试集上也保持了较好的泛化能力（R²=0.86，RMSE≈11.37）。需要强调的是，前面柱状图中展示的最小RMSE来自5折交叉验证，它用于在训练阶段选择最佳的max_depth，反映的是模型在训练集内部的平均误差；而这里的RMSE则是模型在完全独立的测试集上的真实预测误差，因此两者存在一定差异是正常现象，也说明训练误差与测试误差之间保持合理间隔，模型具有较好的稳定性与泛化能力

  
from sklearn.ensemble import RandomForestRegressor  
  
# 参数范围  
max_depth_list = list(range(1, 21))        # 1–20  
n_estimators_list = list(range(1, 101))    # 1–100  
  
# 记录网格搜索结果  
records = []  
  
for depth in max_depth_list:  
    for n_est in n_estimators_list:  
  
        model = RandomForestRegressor(  
            max_depth=depth,  
            n_estimators=n_est,  
            random_state=42  
        )  
  
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')  
        rmse = np.sqrt(-scores.mean())  
  
        records.append([depth, n_est, rmse])  
  
# 从 results 中找到 RMSE 最小的那一行  
best_row = results.loc[results['RMSE'].idxmin()]  
best_depth = int(best_row['max_depth'])  
best_n_estimators = int(best_row['n_estimators'])  
best_rmse = best_row['RMSE']  
  
print("最优参数：")  
print(f"  max_depth    = {best_depth}")  
print(f"  n_estimators = {best_n_estimators}")  
print(f"  对应的CV RMSE = {best_rmse:.4f}")  
  
# 用该最优参数训练最终的随机森林模型  
RF = RandomForestRegressor(  
    max_depth=best_depth,  
    n_estimators=best_n_estimators,  
    random_state=42  
)  
RF.fit(X_train, y_train)

picture.image

通过对随机森林模型在max_depth（1–20）和n_estimators（1–100）的参数网格进行5折交叉验证，找到能使RMSE最小的最优参数组合，并以等高线图形式可视化整个搜索空间；最终结果显示模型在最佳参数下具有最低的交叉验证误差，并在独立测试集上取得较高的R²和较低的 RMSE，表明模型预测性能优良

  
from sklearn.ensemble import GradientBoostingRegressor  
  
max_depth_list = list(range(1, 21))  
n_estimators_list = list(range(1, 101))  
  
records = []  
  
for depth in max_depth_list:  
    for n_est in n_estimators_list:  
  
        model = GradientBoostingRegressor(  
            max_depth=depth,  
            n_estimators=n_est,  
            random_state=42  
        )  
  
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')  
        rmse = np.sqrt(-scores.mean())  
  
        records.append([depth, n_est, rmse])  
          
from sklearn.ensemble import GradientBoostingRegressor  
  
# 从 results 中找到 RMSE 最小的那一行  
best_row = results.loc[results['RMSE'].idxmin()]  
best_depth = int(best_row['max_depth'])  
best_n_estimators = int(best_row['n_estimators'])  
best_rmse = best_row['RMSE']  
  
# 用该最优参数训练最终的 GBDT 模型  
GBDT = GradientBoostingRegressor(  
    max_depth=best_depth,  
    n_estimators=best_n_estimators,  
    random_state=42  
)  
  
GBDT.fit(X_train, y_train)

picture.image

通过对GBDT在不同max_depth和n_estimators组合下进行5折交叉验证以寻找最小RMSE 的最优超参数，并据此训练最终的梯度提升回归模型

根据DRT、RF与GBDT三种模型的超参数优化与最终测试集表现对比，可以得出整体模型评估结论。三类树模型在交叉验证阶段均取得较低的RMSE，并通过独立测试集验证了其良好的泛化能力。其中，随机森林（RF）与梯度提升树（GBDT）在预测精度方面明显优于单一的决策树模型（DRT），尤其在R²与RMSE两项指标上表现更为稳健。综合三者的测试集性能来看，GBDT模型以较高的Test R²≈0.9065与更低的Test RMSE≈9.14表现最佳，说明其更有效地捕捉了特征与目标之间的非线性关系，是在模拟数据集上整体预测性能最优的模型，整体想表达的含义就与文献图3一致

当然，公众号中还有更多机器学习期刊实战技巧，您可以通过历史文章进行检索和阅读，关注公众号，点击“发信息”>“历史文章”即可搜索公众号所有文章信息

picture.image