Abstract
BACKGROUND: Helicobacter pylori (H. pylori) infection remains prevalent in regions such as Shanxi, China, contributing to gastrointestinal morbidity. Accurately identifying high-risk individuals is essential for effective screening and early intervention. METHODS: We conducted a retrospective longitudinal cohort study of 35,206 adults who underwent repeated annual health checkups with H. pylori testing at a single center from 2016 to 2024. Group-Based Trajectory Modeling (GBTM) identified risk subgroups. Multivariable logistic regression identified predictors of high-risk trajectories; alcohol consumption was assessed as an effect modifier. Five machine learning models-including Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting, Logistic regression, etc.-were trained using a 7:3 split. Temporal validation (2016-2020 training/2021-2024 validation) assessed generalizability. SHapley Additive exPlanations (SHAP) improved interpretability. A prediction tool was deployed via R Shiny. RESULTS: GBTM identified high-risk (14.63%) and low-risk (85.37%) groups. Protective factors included women (OR = 0.042, 95% CI: 0.039-0.046) and unmarried status (OR = 0.092, 95% CI: 0.085-0.099); risk factors included obesity (OR = 1.138, 95% CI: 1.070-1.210), blue-collar workers (OR = 1.557, 95% CI: 1.454-1.666), and alcohol consumption (OR = 1.277, 95% CI: 1.165-1.401). Alcohol consumption interacted with all significant factors in subgroup analysis (all p < 0.001), with the strongest interaction observed for being married (OR = 8.622, 95% CI: 7.872-9.437). LightGBM achieved AUCs of 0.851 (training), 0.843 (validation), 0.863 (temporal training), and 0.831 (temporal validation). SHAP ranked marital status and sex as top predictors. The tool is available at: https://prediction-model-for-hp.shinyapps.io/hp_shinyapp-/. CONCLUSION: We developed an online, interpretable risk prediction tool with validated accuracy to support precision screening of H. pylori infection.