Survival prediction from imbalanced colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

利用混合抽样方法和基于树的分类器对不平衡的结直肠癌数据集进行生存预测

阅读:1

Abstract

Colorectal cancer is a high mortality cancer, with a mortality rate of 64.5% for all stages combined. Clinical data analysis plays a crucial role in predicting the survival of colorectal cancer patients, enabling clinicians to make informed treatment decisions. However, utilizing clinical data can be challenging, especially when dealing with imbalanced outcomes, an aspect often overlooked in this context. This paper focuses on developing algorithms to predict 1-, 3-, and 5-year survival of colorectal cancer patients using clinical datasets, with particular emphasis on the highly imbalanced 1-year survival prediction task. We utilized a colorectal cancer dataset from the Surveillance, Epidemiology, and End Results (SEER) database, which exhibits high imbalance in the 1-year (1:10) survival analysis and an imbalance in the 3-year (2:10) analysis, achieving balance in the 5-year analysis. The pre-processing step consists of removing records with missing values and merging categories with less than 2% share for each categorical feature to limit the number of classes of each component. Edited Nearest Neighbor, Repeated Edited Nearest Neighbor (RENN), Synthetic Minority Over-sampling Technique (SMOTE), and pipelines of SMOTE and RENN approaches were used for balancing the data with tree-based classifiers, including Decision Tree, Random Forest, Extra Tree, eXtreme Gradient Boosting, and Light Gradient Boosting Machine (LGBM). The performance evaluation utilizes a 5-fold cross-validation approach. In the case of 1-year, our proposed method with LGBM significantly outperforms other sampling methods with the sensitivity of 72.30%. For the task of 3-year survival, the combination of RENN and LGBM achieves a sensitivity of 80.81%, indicating that our proposed method works best for highly imbalanced datasets. Additionally, when predicting 5-year survival, the sensitivity reaches 63.03% using LGBM. Our proposed method significantly improves mortality prediction for the minority class of colorectal cancer patients. RENN followed by SMOTE yields better sensitivity in the classifiers, with LGBM as the predictor performing best for 1- and 3-year survival. In the 5-year task, LGBM outperforms other models in terms of F1-score.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。