Abstract
BACKGROUND: The frequent exacerbator phenotype (FEP) of chronic obstructive pulmonary disease (COPD) significantly impacts quality of life, increases healthcare burden, and increases mortality rates. This study aims to develop an interpretable machine learning model for the early prediction of FEP to improve patient prognosis. METHODS: Retrospective data were collected from the electronic health records (EHRs) of two hospitals for three independent cohorts of patients hospitalized for the first time due to acute exacerbation of COPD (AECOPD). Patients were categorized into frequent exacerbation and nonfrequent exacerbation groups on the basis of whether they experienced two or more exacerbations requiring hospitalization during a 12-month follow-up period. The feature variables were selected via univariate regression combined with the Boruta algorithm. Nine machine learning models were developed and validated via 5-fold cross-validation. The optimal prediction model was selected by integrating performance on the test set, two independent external datasets, and clinical requirements. The global and local interpretability of the model was achieved via Shapley additive explanations (SHAPs). Restricted cubic splines (RCSs) were employed to analyze the dose‒response relationships between continuous variables and the frequent exacerbator phenotype. Ultimately, the model was deployed on the Shiny platform. RESULTS: This study included a development cohort of 1,310 patients and two external validation cohorts consisting of 418 and 200 patients. The datasets included 64 variables, including demographic information, blood indices, and comorbidities. Following feature screening, 14 key variables were identified to construct machine learning models. In model performance comparisons, the stacking ensemble model demonstrated superior predictive efficacy, generalization ability, and control of missed diagnosis rates. SHAP value analysis ranks the contributions of 14 key variables to the prediction of FEP. Restricted cubic spline (RCS) analysis further revealed dose‒response relationships between nine key continuous variables and FEP. Finally, the research team developed a web-based interactive prediction tool (https://aipd.shinyapps.io/FEPCOPD/). CONCLUSION: This study developed a robust stacking ensemble prediction model for FEP in COPD patients, leveraging multidimensional clinically accessible data. By deploying an interactive prediction tool on the Shiny platform, primary care providers can conveniently utilize the model to facilitate early identification of patients with this high-risk phenotype. CLINICAL TRIAL NUMBER: MR-43-23-040012. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-025-03281-4.