Machine learning and Mendelian randomization identify key lifestyle factors in coronary heart disease: A NHANES-Based study

机器学习和孟德尔随机化方法识别冠心病的关键生活方式因素:一项基于NHANES的研究

阅读:1

Abstract

OBJECTIVE: This study aims to bridge the gap between predictive modeling and causal inference by utilizing lifestyle data from the National Health and Nutrition Examination Survey (NHANES) database to compare the predictive performance of multiple machine learning models for coronary heart disease (CHD). By incorporating Mendelian randomization, the study seeks to validate and identify the lifestyle variables with both predictive power and causal impact on CHD. METHODS: We extracted variables related to demographic characteristics and lifestyle from the NHANES database (2013-2018; n = 29,400). Recursive feature elimination (RFE) was employed to rank variable importance and determine the optimal feature subset. Subsequently, eight machine learning models-including Support Vector Machine (SVM), Neural Network (NN), Naive Bayes (NB), Extreme Gradient Boosting (XGBoost), Multilayer Perceptron (MLP), Generalized Linear Model (GLM), Adaptive Boosting (AdaBoost), and Decision Tree (DT)-were developed for CHD prediction. Model performance was evaluated using metrics such as accuracy, precision, sensitivity, specificity, recall, F1-score, and the Receiver Operating Characteristic (ROC) curve, with variable contributions visualized using Shapley Additive Explanations (SHAP). Additionally, Mendelian randomization (MR) was applied to distinguish associative from causal relationships, validating top predictors via Genome-Wide Association Study (GWAS)-derived genetic instruments. RESULTS: RFE identified age, sex, fasting blood glucose, body mass index (BMI), total cholesterol (TC) intake, sleep duration, diastolic blood pressure, and smoking as the most significant predictors of CHD. Among the models, SVM outperformed DT, AdaBoost, XGBoost, NN, MLP, NB, and GLM. The SVM model achieved the highest performance with an accuracy of 83.4 % and an AUC value of 0.909, demonstrating clinically actionable predictive power. MR confirmed causal associations for five variables: BMI (OR: 1.01, P < 0.001), TC (OR: 1.01, P < 0.001), insomnia (OR: 1.03, P < 0.001), diastolic blood pressure (OR: 1.20, P < 0.001), and smoking (OR: 1.03, P < 0.001), while fasting glucose showed null causality (P > 0.05). CONCLUSION: The SVM machine learning model, based on NHANES data, enables faster and more efficient prediction of CHD. The study identified age, sex, BMI, TC intake, sleep duration, diastolic blood pressure, and smoking as the lifestyle variables with the greatest impact on CHD. This dual approach advances precision prevention by combining predictive accuracy with genetic evidence.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。