Abstract
BACKGROUND: The occurrence of gastric cancer is a complex pathological process leading to multiple abnormalities in clinical laboratory indicators. Machine learning techniques can make it easy to handle millions of variables to make more accurate predictions and diagnoses of diseases. METHODS: Clinical data from gastric cancer patients in a single-center who underwent surgery between 2016 and 2023 were collected. Five machine learning algorithms (extreme gradient boosting, XGBoost; random forest, RF; support vector machine-recursive feature elimination, SVM-RFE; light gradient boosting machine, LGBM; and recursive partitioning, rpart) were utilized to develop diagnostic models. Among the date, 60% were randomly selected to train the models, while the remaining 40% were used for testing. We used the area under the receiver operating characteristic curve (AUROC), F1-score value, sensitivity, and specificity to evaluate the performance of models. RESULTS: The XGBoost algorithm showed the best performance in gastric cancer diagnosis, with significantly higher area under curve (AUC) (combining blood indicators and pathological parameters, AUC=0.9909) value than other models. Glutathione reductase (GR), carbohydrate antigen 724 (CA724), erythrocytes (RBC), carbohydrate antigen 242 (CA242), and albumin (ALB) contributed the most to the diagnosis. The tumor size were independent risk factors for early gastric cancer. CONCLUSION: Machine learning combined blood indicators and pathological parameters could predict gastric cancer risk more accurately. The XGBoost model had the best diagnostic performance. The study provides confirmatory data support for the preclinical implementation of the model.