Interpretable Machine Learning Framework for Diabetes Prediction: Integrating SMOTE Balancing with SHAP Explainability for Clinical Decision Support

用于糖尿病预测的可解释机器学习框架：将 SMOTE 平衡与 SHAP 可解释性相结合，以支持临床决策

阅读：3

作者：Netayawijit,Pathamakorn,Chansanam,Wirapong,Sorn-In,Kanda

期刊：	Healthcare	影响因子：	2.700
时间：	2025	起止号：	2025 Oct 14;13(20)
doi：	10.3390/healthcare13202588	靶点：	SMO
研究方向：	代谢	疾病类型：	糖尿病

Abstract

Background: Class imbalance and limited interpretability remain major barriers to the clinical adoption of machine learning in diabetes prediction. These challenges often result in poor sensitivity to high-risk cases and reduced trust in AI-based decision support. This study addresses these limitations by integrating SMOTE-based resampling with SHAP-driven explainability, aiming to enhance both predictive performance and clinical transparency for real-world deployment. Objective: To develop and validate an interpretable machine learning framework that addresses class imbalance through advanced resampling techniques while providing clinically meaningful explanations for enhanced decision support. This study serves as a methodologically rigorous proof-of-concept, prioritizing analytical integrity over scale. While based on a computationally feasible subset of 1500 records, future work will extend to the full 100,000-patient dataset to evaluate scalability and external validity. We used the publicly available, de-identified Diabetes Prediction Dataset hosted on Kaggle, which is synthetic/derivative and not a clinically curated cohort. Accordingly, this study is framed as a methodological proof-of-concept rather than a clinically generalizable evaluation. Methods: We implemented a robust seven-stage pipeline integrating the Synthetic Minority Oversampling Technique (SMOTE) with SHapley Additive exPlanations (SHAP) to enhance model interpretability and address class imbalance. Five machine learning algorithms-Random Forest, Gradient Boosting, Support Vector Machine (SVM), Logistic Regression, and XGBoost-were comparatively evaluated on a stratified random sample of 1500 patient records drawn from the publicly available Diabetes Prediction Dataset (n = 100,000) hosted on Kaggle. To ensure methodological rigor and prevent data leakage, all preprocessing steps-including SMOTE application-were performed within the training folds of a 5-fold stratified cross-validation framework, preserving the original class distribution in each fold. Model performance was assessed using accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, F1-score, and precision. Statistical significance was determined using McNemar's test, with p-values adjusted via the Bonferroni correction to control for multiple comparisons. Results: The Random Forest-SMOTE model achieved superior performance with 96.91% accuracy (95% CI: 95.4-98.2%), AUC of 0.998, sensitivity of 99.5%, and specificity of 97.3%, significantly outperforming recent benchmarks (p < 0.001). SHAP analysis identified glucose (SHAP value: 2.34) and BMI (SHAP value: 1.87) as primary predictors, demonstrating strong clinical concordance. Feature interaction analysis revealed synergistic effects between glucose and BMI, providing actionable insights for personalized intervention strategies. Conclusions: Despite promising results, further validation of the proposed framework is required prior to any clinical deployment. At this stage, the study should be regarded as a methodological proof-of-concept rather than a clinically generalizable evaluation. Our framework successfully bridges algorithmic performance and clinical applicability. It achieved high cross-validated performance on a publicly available Kaggle dataset, with Random Forest reaching 96.9% accuracy and 0.998 AUC. These results are dataset-specific and should not be interpreted as clinical performance. External, prospective validation in real-world cohorts is required prior to any consideration of clinical deployment, particularly for personalized risk assessment in healthcare systems.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。