Abstract
Stroke is a serious condition associated with high rates of incidence and mortality. Timely and accurate risk assessment is critical for improving prevention and clinical intervention. This study investigates the application of machine learning models in evaluating the risk of stroke. The data for this study were drawn from the National Health and Nutrition Examination Survey, covering the years 1999 to 2002. To construct predictive models and select relevant variables, 3 approaches were applied: LASSO regression combined with stepwise selection, random forest, and the boruta algorithm in conjunction with LASSO regression. Model performance was evaluated through multiple metrics, including receiver operating characteristic and precision-recall curves, as well as calibration and decision curve analyses. The study analyzed data from 9922 participants, among whom 358 had a history of stroke. Key predictors were identified through a combination of LASSO and stepwise regression, producing a model with strong discriminative performance (area under the curve [AUC] = 0.843). In comparison, the random forest approach selected fewer predictors and showed lower predictive accuracy (AUC = 0.612). A model developed using the boruta algorithm followed by LASSO regression achieved a similarly high level of performance (AUC = 0.828). These findings illustrate how different variable selection methods can influence the predictive accuracy of the resulting models. The machine learning model, which is created using the National Health and Nutrition Examination Survey database, and serves as a reliable means of predicting stroke risk. There is every possibility to set up accurate predictive models with the help of different variable selection techniques and modeling methods that have high accuracy and clinical value.