Abstract
ST-segment elevation myocardial infarction (STEMI) is a life-threatening cardiovascular event influenced by meteorological conditions and air pollution. Traditional statistical methods often fail to capture the complex, nonlinear relationships between environmental factors and STEMI risk. This study analyzed hospitalization data from Harbin, China (2014-2023), alongside meteorological and air pollution data. Seven machine learning models, including LightGBM and XGBoost, were used to predict STEMI risk. Key predictors were identified via recursive feature elimination, and model performance was assessed with nested cross-validation. Shapley Additive Explanations (SHAP) were applied to interpret the impact of key predictors. Model performance was evaluated using metrics such as AUC, accuracy, and recall. LightGBM achieved the best performance, with an AUC of 0.84 on the validation set, demonstrating high predictive accuracy and generalizability. Moreover, the LightGBM model improved its interpretability and predictive power by incorporating lagged effects along with recursive feature elimination (RFE). SHAP analysis identified PM[Formula: see text], SO[Formula: see text], NO[Formula: see text], and meteorological factors (e.g., air pressure, humidity, wind speed) as critical contributors, with significant lag effects observed. These findings underscore the cumulative impact of prolonged environmental exposure on cardiovascular health. This study developed a robust, interpretable predictive model that elucidates the complex relationship between environmental factors and STEMI incidence. The results provide valuable insights for early prevention, public health policymaking, and resource allocation.