Abstract
BACKGROUND: Early detection of gastric cancer is crucial for improving prognosis, yet current diagnostic biomarkers remain insufficient for identifying early gastric cancer (EGC, stage I-II). While previous studies have proposed molecular markers, few have systematically validated them across multiple cohorts, and their diagnostic accuracy and immune relevance remain unclear. This study aimed to identify and validate potential early diagnostic biomarkers for EGC using an integrated bioinformatic and machine learning framework. METHODS: The transcriptome data from four Gene Expression Omnibus (GEO) datasets comprising 434 tumor and 100 normal samples were integrated. Only stage I-II gastric cancer samples, defined by pathological criteria according to the American Joint Committee on Cancer Tumor-Node-Metastasis (AJCC TNM) staging system, were included in this study, while advanced-stage cases were excluded to ensure a homogeneous early-stage cohort. Normal gastric tissues were obtained from non-tumor regions of gastrectomy specimens and served as controls. Differentially expressed genes (DEGs) were identified using the limma algorithm. Three machine-learning methods [i.e., least absolute shrinkage and selection operator (LASSO) regression, support vector machine recursive feature elimination (SVM-RFE), and random forest (RF)] were applied to screen feature genes. A diagnostic support vector machine (SVM) model was constructed based on the overlapping DEGs. External validation was conducted using The Cancer Genome Atlas - Stomach Adenocarcinoma (TCGA-STAD) and Human Protein Atlas (HPA) datasets. Functional enrichment and CIBERSORT immune infiltration analyses were performed to explore potential mechanisms. RESULTS: A total of 101 DEGs were identified, and four feature genes (i.e., MCM7, ADAM17, DPT, and KIT) were selected by all three machine-learning algorithms. The SVM diagnostic model showed excellent performance [area under the curve (AUC) =0.998, sensitivity =96.5%, specificity =95.2%]. Among these, MCM7 and ADAM17 were significantly overexpressed in the tumor tissues and associated with a poor prognosis (P<0.05, AUC >0.85). The SHapley Additive exPlanations (SHAP) analysis revealed that these two genes contributed most to the model's predictions. The functional analysis showed MCM7 was enriched in DNA replication and cell cycle pathways, while ADAM17 was involved in inflammatory and tumor-related signaling. The immune infiltration analysis indicated that both genes were significantly associated with various immune cell subpopulations, suggesting a potential role in modulating the tumor immune microenvironment. CONCLUSIONS: This study identified MCM7 and ADAM17 as potential biomarkers for EGC through integrated multi-cohort bioinformatic analysis. Further experimental and clinical studies are required to validate their diagnostic specificity and applicability in real-world settings.