Abstract
BACKGROUND: Patients with acute exacerbations of chronic obstructive pulmonary disease (AECOPD) face a high risk of readmission following discharge. Accurate identification of high-risk individuals is crucial for optimising clinical management. However, clinical prediction models frequently encounter challenges such as limited sample sizes, data missingness, and category imbalance, which compromise their generalisability and clinical utility. METHODS: This retrospective study included patients first hospitalised for AECOPD at a tertiary hospital between December 2018 and July 2023. The primary outcome was unplanned all-cause readmission within one year post-discharge. Missing data were addressed using Multiple Imputation by Chained Equations (MICE). To enhance model robustness, conditional generative adversarial networks (CTGAN) were applied to 80% of the derivation cohort for data augmentation (generating 150% of the original sample size). Logistic regression, decision trees, random forests, XGBoost, and LightGBM models were constructed on the augmented data. Hyperparameters were optimised using grid search and 5-fold cross-validation, with performance evaluated on the reserved 20% test set. The predictive mechanisms of the optimal model were interpreted using the SHAP framework. RESULTS: A total of 1,960 patients were included, of whom 783 (39.9%) experienced readmission. Data augmentation effectively mitigated overfitting and significantly improved model generalisation on the test set. The XGBoost model demonstrated optimal performance, achieving an AUC of 0.696 on the test set alongside favourable calibration and clinical net benefit. SHAP analysis revealed that eosinophil count (EOS, negatively correlated), ICU admission status (positively correlated), red cell distribution width (RDW-SD, positively correlated), Prognostic Nutritional Index (PNI, negatively correlated), and platelet-lymphocyte ratio (PLR, positively correlated) were the most critical features driving model predictions. CONCLUSION: This study successfully developed and validated a readmission risk prediction model for AECOPD patients based on routine clinical variables. The integration of CTGAN data augmentation strategies effectively enhanced model performance. The optimal XGBoost model not only demonstrated strong discriminative capability but also exhibited interpretable predictive logic consistent with clinical pathophysiological mechanisms, as revealed by SHAP analysis. This model holds potential for clinical translation, aiding in the identification of high-risk individuals for readmission and enabling early intervention. CLINICAL TRIAL: Not applicable.