Abstract
BACKGROUND: Primary bronchogenic lung cancer (PBLC) poses a serious threat to human health with its high mortality rate largely attributed to challenges in reliable early detection. Hence, the early identification of PBLC is essential for subsequent patient treatment. Machine learning (ML) models that utilize accessible data, such as routine blood tests and tumor markers, present a promising approach for enhancing early screening rates. This study aims to construct an ML prediction model based on the combined analysis of routine blood tests and tumor markers and to establish an early intelligent screening platform for PBLC through systematic integration and development of technology so as to improve the early screening rate of PBLC. METHODS: This study used samples from the PBLC group and the healthy control (HC) group from 2018 to 2023 (n=1,054). Data from The Affiliated Dazu's Hospital of Chongqing Medical University were used for model construction and internal validation (n=767), and data from the Chongqing Dazu District People's Hospital Medical Community were used for external validation (n=287). After feature selection using the least absolute shrinkage and selection operator (LASSO) algorithm, 14 features were selected, including routine blood tests and tumor markers. Subsequently, 10 ML models were used to establish prediction models using eight evaluation metrics, including accuracy, sensitivity, specificity, and area under the curve (AUC), to develop an early PBLC prediction tool. RESULTS: Among multiple ML models for early prediction of PBLC in patients, the Xtreme Gradient Boosting (XGBoost) model achieved an AUC above 0.980 in both internal and external validation. Basophils, lymphocytes, and carcinoembryonic antigen (CEA) ranked highest in feature importance for early PBLC prediction, suggesting that the indicators from routine blood tests and tumor markers jointly influence the predictive performance, thereby underscoring the practicality of integrating these two types of indicators in model development. CONCLUSIONS: The ML models developed possess substantial application value in the early screening of PBLC, which is beneficial for the prompt detection and treatment of individuals diagnosed with PBLC.