Abstract
BACKGROUND: Cardiovascular disease (CVD) continues to be a significant health threat to humans globally, and a significant burden on healthcare systems. Cardiovascular risk prediction utilizing machine learning (ML) models in patients with asthma remains vastly underexplored. METHODS: In this cohort study consisting of 641,042 participants, we used routinely collected electronic healthcare record data to explore various ML algorithms including logistic regression, penalized logistic regression, decision trees, random forest and gradient boost to develop a model with high specificity. RESULTS: The penalized logistic regression model was identified to be the best and simplest classification model in terms of discriminatory power (AUC = 0.85). The gradient boost model was found to be the best predictive model in terms of calibration where the predicted and observed probabilities at risk of CVD match or are closely aligned. In all models, the number of previous cardiovascular events was the most influential predictor, followed by age and prescriptions related to cardiovascular medications. The top predictor alone produced a reasonable level of predictive power (AUC = 0.66). CONCLUSION: We have created a novel prediction model for predicting CVD within a year of asthma diagnosis for patients with asthma at least 50 years old. Using penalized logistic regression, we achieved a high level of accuracy. By implementing this model, it would be possible to screen out patients with low risk of CVD with high specificity and acceptable sensitivity. Penalized logistic regression and gradient boost models have similar accuracy in screening out individuals at low risk of CVD. For this objective, penalized logistic regression may be more suitable than gradient boost models for implementation as it is simpler to use and more transparent. At the probability threshold of 8% (outcome prevalence), both models' effectiveness in reducing unnecessary treatments was by approximately 52%. These ML models performed better compared with traditional statistical-based risk prediction models. The unique contribution of the study is the construction of prediction models for CVD disease within 12 months from asthma diagnosis based on regression and machine learning models and the comparison of their accuracy to identify the best model based on suitable statistical measures such as AUC and calibration. Further prospective studies using different populations and external validation are required to assess and validate the ML risk prediction models.