Abstract
BACKGROUND: Cardiovascular disease (CVD) is a leading cause of death in cancer survivors. Predicting CVD risk in this population remains challenging. Risk prediction models that use machine learning algorithms could provide an objective method for accurate risk prediction to facilitate the prevention and management of CVD in cancer survivors. We evaluated previously tested machine learning algorithms and logistic regression (regularized by machine learning methods), in addition to testing and validating newer, more complex machine learning algorithms, for CVD prediction in cancer survivors. METHODS: This multicenter study used a database of 3835 multiracial cancer survivors with 89 clinical, laboratory, and echocardiographic features over 20 years. Models were trained using repeated random and time-split samples and tested on a separate cohort of 329 patients. Model performance was assessed using the area under the receiver operating characteristic curve. RESULTS: Regularized logistic regression achieved an area under the receiver operating characteristic curve of 0.845 (heart failure), 0.783 (atrial fibrillation), 0.792 (coronary artery disease), and 0.806 (composite CVD). These are comparable to 0.837 and 0.848 for heart failure, using Bayesian additive regression tree and random forest as more advanced machine learning models, respectively. De novo composite CVD (post-cancer diagnosis) was also predicted with an area under the receiver operating characteristic curve of 0.826 using regularized logistic regression, compared with 0.735 and 0.802 using decision tree and random forest, respectively. CONCLUSIONS: Regularized logistic regression and advanced machine learning models demonstrated similar predictive performance, with institutional transferability. These tools may support risk stratification and prevention strategies in cardio-oncology using longitudinal data. REGISTRATION: URL: https://www.clinicaltrials.gov; Unique identifier: NCT05377320.