Abstract
INTRODUCTION: Cardiovascular disease (CVD) is the leading cause of death among individuals with diabetes, accounting for nearly 50% of diabetes-related mortality. In Ethiopia, the burden of diabetes is increasing, yet there is a lack of predictive tools for identifying those at highest risk of developing CVD. In Ethiopia recent studies report a CVD prevalence of 37.26% among diabetic patients. This study employed machine earning to predict CVD among Ethiopia diabetic patients using Ethiopian public Health Institute (EPHI) datasets, with a focus on identifying the most influential risk factors for public health decision-making. OBJECTIVE: The main objective of this study is to predict CVD among diabetic patients in Ethiopia using machine learning techniques. METHOD: The dataset comprised of 9030 instances with 22 features sourced from Ethiopian Public Health Institute. This prediction of cardiovascular disease (CVD) incorporated socio-demographic, behavioral, and clinical measurement data. Logistic regression, decision tree, Support Vector Machine, Random forest, Gradient boosting machine and artificial neural network were employed. Those models were trained on 80% of the data and tested on the remaining 20%. The analysis was conducted with python using 3.10. RESULTS: According to the results analyzed, Gradient Boosting Model (GBM) demonstrated the highest overall performance, achieving an accuracy of 93%, followed closely by Logistic Regression (LR) with 90% accuracy. In terms of precision, GBM and LR performed comparably, while the LR achieved the highest recall at 88%. Regarding the F1 score, GBM attained 82%, indicating a strong balance between precision and recall. Additionally, the receiver operating characteristics (ROC) analysis showed that GBM had the largest area under the curve (AUC) of 0.96, reflecting superior descriptive ability 0.96. CONCLUSION: The gradient boosting machine (GBM) model demonstrated the highest performance compared to the other models, achieving an accuracy of 93%. The most significant factors influencing the GBM model were total cholesterol, hypertension, and fasting blood glucose levels. The gradient boosting model shows potential for future integration into clinical decision-support systems, pending external validation and early prediction of cardiovascular disease in individuals with diabetes.