Abstract
Colorectal cancer (CRC) is a major source of cancer-related deaths, but early detection at the adenoma stage markedly improves outcomes. Existing tools such as colonoscopy and fecal immunochemical testing (FIT) are invasive or insensitive to early lesions. To develop a non-invasive screening strategy, we analyzed five publicly available 16 S rRNA sequencing datasets from North American and East Asia. Using Analysis of Compositions of Microbiome with Bias Correction (ANCOM-BC) and chi-square testing, we identified 109 discriminatory microbial taxa and trained random forest (RF) classification models to distinguish healthy controls, adenomas, and CRC. The models performed well in internal validation (AUC = 0.90, 95% CI: 0.869-0.931) and external validation (AUC = 0.82), indicating cross-population generalizability. We further developed a microbial risk score (MRS), inspired by polygenic risk score (PRS), methodology, which was significantly elevated in CRC across cohorts. Enrichment of CRC-associated pathogens such as Fusobacterium nucleatum and Porphyromonas gingivalis supports the biological relevance of the findings. These results demonstrate the potential of gut microbiome signatures combined with machine learning as scalable, non-invasive approach for early CRC and adenomas detection.