Abstract
BACKGROUND: The early diagnostic delay in rheumatoid arthritis-associated interstitial lung disease (RA-ILD) underscores the importance of presymptomatic identification of at-risk populations. Here, we developed and validated RA-ILD diagnostic prediction models via machine learning (ML) algorithms with routine clinical and laboratory data. METHODS: The model was developed on the basis of retrospective data from a single-center cohort of 1,156 RA patients (RA-ILD = 400, RA-non-ILD = 756) and subsequently validated in the China Rheumatoid Arthritis Registry of Patients with Chinese Medicine (CERTAIN) which included 178 RA-ILD patients and 178 RA-non-ILD patients. Candidate variables for the predictive model were selected through a multifactor regression analysis. Model performance was evaluated via receiver operating characteristic (ROC) curves and precision‒recall (PR) curves. Significant features were identified through the application of the SHapley Additive exPlanations (SHAP) model.The SHAP model results indicated that the anti-cyclic citrullinated peptide (anti-CCP) titer was the most significant contributor to the classification, followed by lactate dehydrogenase (LDH). RESULTS: The final model incorporates 13 predictors. Among the nine ML algorithms evaluated, the random forest (RF) and LightGMB (LGBM) algorithms were robust. In the derivation cohort, RF had an AUC of 0.773 and a mAP of 0.776, and LGBM had an AUC of 0.768 and a mAP of 0.787. In the external validation cohort, the RF model achieved an AUC of 0.713 and a mAP of 0.703, and the LGBM model had an AUC of 0.704 and a mAP of 0.750. The SHAP model results indicated that the anti-cyclic citrullinated peptide (anti-CCP) titer was the most significant contributor to the classification, followed by lactate dehydrogenase (LDH). CONCLUSIONS: Our ML models, derived from routine clinical data, identify RA patients at high risk for ILD but require prospective validation in diverse cohorts (including radiologic subtyping) before clinical deployment. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12931-025-03416-1.