Abstract
INTRODUCTION: Interstitial lung disease (ILD) comprises various disorders marked by pulmonary inflammation and fibrosis. Early diagnosis and risk prediction are vital for improving patient outcomes. METHODS: We retrospectively analyzed 603 patients who had visited the Hubin Campus between January 2022 and April 2025, employing a 1:2 case-control design with age- and gender-matched groups. We collected clinical information, complete blood count data, lipid metabolism indicators, and various derived indices. CONCLUSION: Six key markers were identified through three machine learning algorithms (LassoCV, SVMREFCV, and Boruta): neutrophil percentage, lymphocyte percentage, monocyte percentage, hemoglobin, and two novel ratios - neutrophil-to-HDL-C and lymphocyte-to-HDL-C. The random forest model outperformed seven other machine learning approaches, with AUC values of 0.868 (validation set), 0.885 (test set), and 0.849 (external cohort), demonstrating consistent predictive accuracy. DISCUSSION: Based on these findings, we developed an online prediction tool to assist primary care clinicians in assessing the risk of ILD in suspected cases. Our results indicate that the random forest model exhibits high accuracy and clinical utility for early ILD prediction, providing a novel tool and methodology for early diagnosis and intervention. Future studies will focus on further optimizing the model and validating it in larger multicenter cohorts.