Development and validation of machine learning models for predicting synchronous lung metastasis in United States colorectal cancer patients: a SEER database analysis

利用机器学习模型预测美国结直肠癌患者同步性肺转移:基于SEER数据库的分析

阅读:5

Abstract

BACKGROUND: Colorectal cancer lung metastases (CRCLM) significantly influence treatment planning and prognosis in colorectal cancer (CRC). This study aimed to develop and validate machine learning-based models to support individualized risk stratification for chest computed tomography (CT) utilization during baseline evaluation by predicting synchronous CRCLM at diagnosis. METHODS: Patients with primary CRC diagnosed between 2010 and 2015 were identified from the Surveillance, Epidemiology, and End Results (SEER) database using International Classification of Diseases for Oncology, 3(rd) edition (ICD-O-3) codes. Synchronous CRCLM was defined by the variable "CS Mets at DX-Lung". Predictors included age, sex, race, primary tumor site, grade, histologic type, tumor stage (T stage), node stage (N stage), tumor size, carcinoembryonic antigen (CEA) level, tumor deposits, and perineural invasion. The cohort was randomly divided into training (70%) and validation (30%) sets. eXtreme gradient boosting (XGB), random forest (RF), decision tree (DT), and logistic regression (LR) models were developed and evaluated mainly by receiver operating characteristic (ROC) curve, calibration curve, and decision curve analysis (DCA). Model interpretability was assessed using SHapley Additive exPlanation (SHAP). RESULTS: Among 51,553 patients, 1,329 (2.6%) had synchronous CRCLM. In the validation cohort, the area under the curve was 0.81 for XGB, 0.81 for RF, 0.79 for DT, and 0.73 for LR after hyperparameter optimization. Calibration curves indicated high consistency between predictions and observations. DCA revealed substantial clinical utility for all models. SHAP analysis highlighted CEA and N stage as the strongest predictors in the RF model, while CEA and T stage were most influential in the XGB model. CONCLUSIONS: Machine learning models, particularly XGB and RF, demonstrated robust performance in predicting synchronous CRCLM. CEA was consistently identified as the most important risk factor, supporting personalized chest CT utilization during initial CRC staging.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。