期刊复现：基于自动机器学习AutoGluon的预测模型构建 - 文章 - 开发者社区

picture.image

✨ 欢迎关注Python机器学习AI ✨

本节介绍：基于自动机器学习AutoGluon的预测模型构建，数据采用模拟数据无任何现实意义，作者根据个人对机器学习的理解进行代码实现与图表输出，仅供参考。完整数据和代码将在稍后上传至交流群，成员可在交流群中获取下载。需要的朋友可关注公众文末提供的获取方式。文末提供高效的AI工具~！

✨ 论文信息 ✨

picture.image

图表展示Flaml、AutoML H₂O和AutoGluon三个自动化机器学习模型框架在不同数据集划分比率下的表现，包括R²值、实际与预测的TN浓度对比以及各自的MAE和RMSE指标，模型在不同划分比率下的预测精度（这里对比的训练时间为60s），可以发现 Flaml框架下划分比例为2：8模型性能是最高的，接下来复现主要简单实现 AutoGluon 框架下的预测模型构建，更多该文献信息参考往期文章——期刊复现：基于自动机器学习的预测模型构建及其残差和部分依赖分析

✨ 基础代码 ✨

  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['font.family'] = 'Times New Roman'  
plt.rcParams['axes.unicode_minus'] = False  
import warnings  
# 忽略所有警告  
warnings.filterwarnings("ignore")  
  
path = r"2025-9-19公众号Python机器学习AI.xlsx"  
df = pd.read_excel(path)  
from sklearn.model_selection import train_test_split  
X = df.drop(['SR'], axis=1)  
y = df['SR']    
# 划分训练集和测试集  
X_train, X_test, y_train, y_test = train_test_split(  
    X,    
    y,   
    test_size=0.3,   
    random_state=250807  
)  
  
from autogluon.tabular import TabularPredictor  
  
# 将训练集和目标变量合并为一个 DataFrame 供 AutoGluon 使用  
train_data = X_train.copy()  
train_data['SR'] = y_train  
  
# 初始化 TabularPredictor（指定目标列）  
predictor = TabularPredictor(label='SR', path='autogluon_models').fit(  
    train_data,  
    time_limit=60  # 最大训练时间（秒）  
)

使用AutoGluon进行自动化机器学习，读取数据并将其划分为训练集和测试集，然后训练一个模型来预测目标变量SR，并将训练过程的最大时间限制为60秒（可设置更大的迭代时间默认存在早停机制），自动化机器学习的优点：节省时间、简化流程、提高效率，适合非专家，缺点：可解释性差、计算成本高、灵活性有限、过拟合风险

对于大多数应用，特别是需要透明化、可调节、可解释的模型，sklearn依然是首选。它让你能够对每个步骤进行精确控制，理解模型内部的细节，并且能进行深度定制和优化

  
......  
Fitting model: CatBoost ... Training model for up to 52.63s of the 52.63s of remaining time.  
 Fitting with cpus=8, gpus=0  
 Ran out of time, early stopping on iteration 8228.  
 -8.0179 = Validation score   (-root_mean_squared_error)  
 52.77s = Training   runtime  
 0.0s = Validation runtime  
Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.94s of the -0.23s of remaining time.  
 Ensemble Weights: {'CatBoost': 0.92, 'RandomForestMSE': 0.08}  
 -8.0103 = Validation score   (-root_mean_squared_error)  
 0.01s = Training   runtime  
 0.0s = Validation runtime  
AutoGluon training complete, total runtime = 60.27s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 1554.7 rows/s (78 batch size)  
......

这是利用 AutoGluon训练模型的一个信息输出，CatBoost模型训练约52.63秒，在迭代过程中由于时间用尽，发生提前停止，最终的验证RMSE为8.0179，WeightedEnsemble_L2模型是基于多个模型的加权集成，它的加权值为CatBoost:0.92和RandomForestMSE:0.08，通过加权集成的方式得出验证RMSE为8.0103，即比单个模型更好，当然这里只是一部分模型输出

AutoGluon使用CatBoost和RandomForestMSE训练模型，并通过加权集成得到最优模型，最终完成了训练，也就是模型 WeightedEnsemble_L2

  
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score  
  
y_train_pred = predictor.predict(X_train)  
y_pred = predictor.predict(X_test)  
# 计算不同的性能指标  
mae = mean_absolute_error(y_train, y_train_pred)  
mse = mean_squared_error(y_train, y_train_pred)  
rmse = np.sqrt(mse)  
r2 = r2_score(y_train, y_train_pred)  
  
# 打印性能指标  
print(f'训练集Mean Absolute Error (MAE): {mae}')  
print(f'训练集Mean Squared Error (MSE): {mse}')  
print(f'训练集Root Mean Squared Error (RMSE): {rmse}')  
print(f'训练集R-squared (R2): {r2}')  
  
print(f'-------------------------')  
  
# 计算不同的性能指标  
mae = mean_absolute_error(y_test, y_pred)  
mse = mean_squared_error(y_test, y_pred)  
rmse = np.sqrt(mse)  
r2 = r2_score(y_test, y_pred)  
  
# 打印性能指标  
print(f'测试集Mean Absolute Error (MAE): {mae}')  
print(f'测试集Mean Squared Error (MSE): {mse}')  
print(f'测试集Root Mean Squared Error (RMSE): {rmse}')  
print(f'测试集R-squared (R2): {r2}')

使用AutoGluon训练的最优模型上进行预测对训练集和测试集进行预测，用于评估模型在训练集和测试集上的表现

  
训练集Mean Absolute Error (MAE): 1.3505179565091372  
训练集Mean Squared Error (MSE): 12.977376674896297  
训练集Root Mean Squared Error (RMSE): 3.602412618634392  
训练集R-squared (R2): 0.9841520819477437  
-------------------------  
测试集Mean Absolute Error (MAE): 6.8785289958708296  
测试集Mean Squared Error (MSE): 96.13593650709717  
测试集Root Mean Squared Error (RMSE): 9.8048934979987  
测试集R-squared (R2): 0.881238429418675

这些评价指标表明，模型在训练集上的表现非常好（高 R² 和低误差），但在测试集上出现了较大的误差（较高的 MAE、MSE 和 RMSE），表明可能存在一定的过拟合问题，模型在训练集上过于拟合，而在新数据（测试集）上泛化能力较弱

picture.image

绘制一个散点图和边缘直方图的联合图，展示训练集和测试集的实际值与预测值之间的关系

  
# 打印 AutoGluon 所训练的所有模型的 leaderboard  
leaderboard = predictor.leaderboard(silent=False)  
  
# 打印最佳模型的详细信息  
best_model = leaderboard.iloc[0]  # 获取性能最好的模型  
print(f"Best Model: {best_model}")

通过AutoGluon的leaderboard() 方法获取所有训练模型的表现排名，并打印出性能最好的模型的详细信息

  
                 model  score_val              eval_metric  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order  
0  WeightedEnsemble_L2  -8.010347  root_mean_squared_error       0.050172  53.450475                0.001002           0.008574            2       True          5  
1             CatBoost  -8.017939  root_mean_squared_error       0.002456  52.771816                0.002456          52.771816            1       True          4  
2           LightGBMXT  -8.798547  root_mean_squared_error       0.007994   5.946676                0.007994           5.946676            1       True          1  
3      RandomForestMSE  -8.996581  root_mean_squared_error       0.046714   0.670086                0.046714           0.670086            1       True          3  
4             LightGBM  -9.153150  root_mean_squared_error       0.004008   0.362878                0.004008           0.362878            1       True          2  
Best Model: model                         WeightedEnsemble_L2  
score_val                               -8.010347  
eval_metric               root_mean_squared_error  
pred_time_val                            0.050172  
fit_time                                53.450475  
pred_time_val_marginal                   0.001002  
fit_time_marginal                        0.008574  
stack_level                                     2  
can_infer                                    True  
fit_order                                       5