SMENN-hybrid: an efficient technique combining the synthetic minority oversampling technique with ensemble learning for diabetes prediction

SMENN-hybrid:一种将合成少数类过采样技术与集成学习相结合的高效糖尿病预测技术

阅读:1

Abstract

Diabetes Mellitus (DM) is a chronic metabolic disorder and a major global health problem, with many cases undiagnosed. Early detection and effective management are essential to prevent complications. This paper presents an efficient hybrid technique that combine the Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbors (SMOTE-ENN) with ensemble learning termed (SMENN-Hybrid). Gradient Boosting was identified as the most effective ensemble method through rigorous multi-metric evaluation. The proposed approach was rigorously evaluated across five diverse datasets: PIMA India, Diabetes Prediction Dataset (DPD), Diabetes Dataset 2019, Raw Merged Dataset (RMD), and Cleaned Merged Dataset (CMD). A comprehensive multi-metric assessment considering F1-Score, ROC-AUC, and Accuracy demonstrated exceptional generalizability, with Gradient Boosting achieving a composite score of 99.93/100 and maintaining coefficients of variation below 2% across all metrics (mean F1=0.9860, ROC-AUC=0.9990, Accuracy=0.9860). 5-fold stratified cross-validation confirmed remarkable stability (overall CV < 1.65% for all metrics), while systematic ablation studies validated the essential synergy between SMOTE and ENN, showing average improvements of +16.78% in F1-Score and +29.47% in Recall over unbalanced baselines. Compared to traditional methods (Logistic Regression and Decision Tree), the proposed framework achieved consistent improvements of +2.99% average F1-Score over the best baseline, with individual dataset gains ranging from +3.25% to +3.99%. Despite 246× longer training time, inference remains practical at 2.47ms, making the approach suitable for real-time clinical deployment. The combination of high effectiveness (mean F1=0.9841), exceptional consistency (CV < 2%), and comprehensive validation across multiple datasets and evaluation dimensions positions this framework as a clinically deployable solution for diabetes detection without dataset-specific tuning, offering significant advantages for similar healthcare classification tasks.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。