Abstract
OBJECTIVE: Cardiovascular diseases (CVD) remain the leading cause of mortality globally, necessitating early risk identification to improve prevention and management strategies. Traditional risk prediction models, such as the Framingham Cardiovascular Risk Score and the Systematic Coronary Risk Evaluation, often fall short due to their reliance on classical statistical methods and limited variable scope. MATERIALS AND METHODS: This study harnessed machine learning (ML) techniques to develop and validate a comprehensive CVD risk prediction model using data from the UK Biobank cohort. Our approach incorporated genetic scores, clinical parameters and lifestyle factors to create separate models for overall CVD, cerebrovascular diseases, thrombotic diseases and other cardiovascular conditions. We used a diverse set of ML algorithms, including Ridge Regression, Logistic Regression, Support Vector Machines, Random Forest, XGBoost, Multilayer Perceptron and Stacking, with rigorous cross-validation procedures to ensure robustness. RESULTS: From an initial cohort of 502 407 individuals, 240 644 participants were included in the final analysis. The integrated model, combining clinical, lifestyle and genetic data, achieved the highest predictive performance with an area under the curve (AUC) of 0.85. Models based solely on clinical data, lifestyle data and polygenic risk scores showed lower predictive power. Stratified analysis revealed variable performance across CVD subtypes, with the highest AUC of 0.83 for other cardiovascular diseases. Key predictors included age, systolic blood pressure, cystatin C levels and waist circumference. CONCLUSION: The integration of clinical, lifestyle and genetic data substantially enhances the predictive accuracy of ML models for CVD risk. Our model demonstrates robust performance, with an improvement of 0.12 in AUC over the Framingham Cardiovascular Risk Score, highlighting its potential utility in clinical settings for early risk identification and personalised intervention strategies. Future research should focus on refining the model by incorporating additional predictors and validating its applicability across diverse populations.