基于改进SMOTE算法和Ensemble模型的学习结果预测方法

王晓勇; 胡胜利

基于改进SMOTE算法和Ensemble模型的学习结果预测方法

Learning Result Prediction Based on Improved SMOTE Algorithm and Ensemble Model

摘要

摘要: 为解决不同领域的数据分类和预测任务中单个机器学习算法适用性较差的问题，以及缓解数据集严重不平衡对预测性能的影响，提出了基于合成少数类过采样（SMOTE）和Ensemble集成模型的数据分类方法。传统SMOTE算法通过对少数类样本进行插值来生成新的合成样本，合成样本中存在噪声和样本间相似性较高的问题。为此，提出了改进的SMOTE算法，通过距离计算移除噪声样本和易混淆样本，得到高区分度的纯净合成样本。然后，利用Ensemble方法调整样本和分类器权重，并组成分类效果更好的强分类器。在公开在线学习数据集Kalboard360上的实验结果表明，使用极限随机树（ERT）分类器时，结合改进SMOTE和Ensemble模型后实现了97.9%的预测准确度，比单个ERT分类器提升了5.5%，证明所提改进SMOTE算法能够生成高质量的均衡化数据，且集成学习模型的性能显著优于单个机器学习算法。

Abstract: In order to solve the problem of poor applicability of a single machine learning algorithm in data classification and prediction tasks in different fields, and to alleviate the impact of severe imbalance in datasets on prediction performance, a learning result prediction method based on Synthetic Minority Oversampling（SMOTE） and the ensemble model was proposed. The traditional SMOTE algorithm generated new synthetic samples by interpolating minority class samples, which could result in the presence of noise and high similarity between synthetic samples. To address these issues, an improved SMOTE algorithm was proposed, which removed noisy and easily confused samples by distance calculation, resulting in high discriminative and pure synthetic samples. Subsequently, an ensemble method was utilized to adjust the weights of samples and classifiers, leading to the creation of a stronger classifier with improved classification performance. Experimental results on the public online learning dataset Kalboard 360 show that when using the Extreme Randomized Trees（ERT） classifier, in combination with improved SMOTE and Ensemble model, resulted in a prediction accuracy of 97. 9%, which is a 5. 5% increase compared to using a single ERT classifier. This demonstrates that the proposed SMOTE algorithm can generate highquality balanced data, and the performance of the Ensemble learning model is significantly better than that of a single machine learning algorithm.

HTML全文

参考文献(0)

施引文献

资源附件(0)