Abstract
Patients with type 2 diabetes mellitus (T2DM) have a significantly higher risk of cardiovascular disease (CVD) compared to the general population. Accurately predicting this risk is crucial for developing personalized treatment plans and public health interventions. This study aims to develop and validate a model for predicting CVD risk in T2DM patients using the Boruta feature selection algorithm and machine learning methods. We analyzed data from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2018. Six machine learning (ML) models, including Multilayer Perceptron (MLP), Light Gradient Boosting Machine (LightGBM), Decision Tree (DT), Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), and k-Nearest Neighbors (KNN), were employed for model development and validation. Boruta was used for optimal feature selection. The performance of the machine learning models was comprehensively evaluated using ROC curves, accuracy, and other related metrics. Shapley Additive Explanation (SHAP) analysis was conducted for visual interpretation, and the Shinyapps.io platform was utilized to deploy the best-performing models as web-based applications. A total of 4,015 T2DM patients were included, among which 999 (24.9%) had CVD. Model evaluation revealed significant overfitting with the KNN algorithm, which showed perfect discrimination in the training set but performed poorly in the test set (AUC = 0.64). In contrast, XGBoost demonstrated more consistent performance between training and testing datasets (AUC = 0.75 and 0.72, respectively), indicating better generalization ability and making it more suitable for clinical application. Using SHAP analysis, the top 10 important influencing factors identified by the XGBoost model were utilized to construct a CVD risk prediction platform for T2DM patients. The prediction model based on Boruta feature selection and machine learning shows promising results in assessing the CVD risk among T2DM patients. This study provides a viable tool for clinical use, facilitating early intervention and precision treatment.