Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)

中老年人群心血管疾病(CVD)发病率特征及机器学习风险预测:来自中国健康与退休纵向研究(CHARLS)的数据

阅读:2

Abstract

BACKGROUND: Due to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a prediction model by using data from the China Health and Retirement Longitudinal Study (CHARLS). METHODS: We used follow-up data from the CHARLS to analyse CVD incidence in the Chinese middle-aged and elderly population over a time span of 9 years. Five machine learning (ML) algorithms were employed for risk prediction. Data preprocessing included missing value imputation via random forest. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (Lasso CV) method with cross-validation prior to model training. The application of the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model performance was evaluated via analyses including the area under the ROC curve (AUC), precision, recall, F1 score, and SHAP plots for interpretability. RESULTS: In accordance with the exclusion criteria, 12,580, 12,061, 11,545, and 11,619 participants were enrolled in four follow-up rounds. The cumulative incidence (CI) of CVD at 2, 4, 7, and 9 years was 2.846%, 8.971%, 17.869% and 20.518%,, respectively. Significant differences in CVD incidence were observed across gender, age, ethnicity, and region, with higher rates observed in females and in the northeast region. Ultimately, 8,080 participants and 24 features were analysed for CVD risk prediction. Five ML models were built based on these features. Although the LGB model achieves an AUC of 0.818, indicating strong overall performance, its F1 score and recall rate are relatively low, at 0.509 and 43.1%, respectively. Shapley additive explanations (SHAP) analyses revealed the importance of key features, such as night sleep duration, TG levels, and waist circumference, in predicting outcomes, and highlighted the nonlinear relationships between these features and CVD risk. CONCLUSIONS: Gender, age, ethnicity, and region are significant factors influencing CVD incidence. Although the LGB model demonstrates good overall performance, its low F1 score and recall rate reveal limitations in identifying high-risk cardiovascular disease patients.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。