Abstract
OBJECTIVE: Recurrent pregnancy loss (RPL), defined as two or more consecutive spontaneous miscarriages before 20 weeks of gestation, affects 2-5% of reproductive-age women globally, and current clinical predictors for it lack sufficient accuracy. This study aimed to construct a machine learning (ML) model for RPL prediction by integrating serum IL-33, C-reactive protein (CRP), and lymphocyte subset counts, and validate its performance in a retrospective cohort. METHODS: A total of 340 reproductive-age women from XiDian Group Hospital and Xi'an Traditional Chinese Medicine Hospital (January 2020-December 2024) were enrolled. Baseline clinical characteristics, IL-33, CRP levels, and lymphocyte subset counts were collected as predictors, with RPL as the primary outcome. The dataset was split into a training set (70%) and a validation set (30%). Logistic regression, random forest, and XGBoost were trained with hyperparameter optimization via grid search, and model performance was evaluated by AUC, accuracy, sensitivity, specificity, PPV, and NPV. RESULTS: Of the 340 participants, 85 (25.0%) had RPL and 255 (75.0%) did not. The RPL group had significantly lower IL-33 and CD4+/CD8+ ratio, higher CRP and NK cell proportions (all p < 0.001). XGBoost outperformed the other two models, with an AUC of 0.89 (95% CI: 0.82-0.96) in the training set and 0.85 (95% CI: 0.76-0.94) in the validation set; its validation set accuracy, sensitivity, specificity, PPV and NPV were 88.1%, 82.4%, 88.7%, 28.6% and 98.7%, respectively. CONCLUSION: The ML model integrating IL-33, CRP, and lymphocyte subset counts shows good discriminatory ability for RPL, providing a preliminary reference for identifying high-risk women in clinical practice.