Abstract
BACKGROUND: Accurate risk stratification and early detection of colorectal cancer (CRC) are critical for improving patient outcomes and optimizing the use of colonoscopy; however, the diagnostic performance of existing biomarkers remains suboptimal. This study aimed to develop and evaluate machine learning (ML)-based models to facilitate individualized risk assessment and clinical decision-making for colorectal lesions. METHODS: A total of 1,714 participants who underwent colonoscopy at Department of Gastrointestinal Surgery, Ruijin Hospital, Shanghai Jiaotong University School of Medicine were included. Participants were categorized into normal colonoscopy controls (n = 677) and high-risk colorectal diseases group (n = 1,037), with the latter further subdivided into adenomas (n = 376) and CRC (n = 661) subsets. Demographic characteristics and relevant laboratory data were collected. Variables significantly associated with high-risk colorectal conditions or CRC were identified using univariable and multivariable logistic regression analyses and incorporated into two independent nomogram-based ML models. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), calibration curve, and decision curve analysis (DCA). SHapley Additive exPlanations (SHAP) analysis was performed to determine each feature's contribution. RESULTS: Gender, age, hemoglobin (Hb), C-reactive protein (CRP), carcinoembryonic antigen (CEA), and Septin9 methylation were independent predictors of high-risk colorectal diseases, with the latter five also specific for CRC (p < 0.001). Two ML models were developed: one predicting the probability of high-risk colorectal diseases and the other distinguishing CRC from adenoma. Both models demonstrated strong discriminative ability with high AUCs and favorable net clinical benefit on DCA. Calibration curves showed close concordance between predicted risk and the observed outcomes. SHAP analysis highlighted Septin9 methylation as the most influential variable in the predicting model. Threshold values of 37.3 and 67.1 points were identified as optimal cutoffs for high-risk diseases and CRC discrimination, respectively. CONCLUSIONS: We developed and validated two ML-based models integrating Septin9 methylation with routine serum biomarkers for early detection and differentiation of CRC. These models show potential as non-invasive clinical decision-support tools to facilitate individualized risk assessment and support clinical management in patients undergoing evaluation for colorectal neoplasia.