Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records

利用电子健康记录中的结构化和非结构化数据源,通过机器学习算法预测乳腺癌复发

阅读:2

Abstract

Recurrence is a critical aspect of breast cancer (BC) that is inexorably tied to mortality. Reuse of healthcare data through Machine Learning (ML) algorithms offers great opportunities to improve the stratification of patients at risk of cancer recurrence. We hypothesized that combining features from structured and unstructured sources would provide better prediction results for 5-year cancer recurrence than either source alone. We collected and preprocessed clinical data from a cohort of BC patients, resulting in 823 valid subjects for analysis. We derived three sets of features: structured information, features from free text, and a combination of both. We evaluated the performance of five ML algorithms to predict 5-year cancer recurrence and selected the best-performing to test our hypothesis. The XGB (eXtreme Gradient Boosting) model yielded the best performance among the five evaluated algorithms, with precision = 0.900, recall = 0.907, F1-score = 0.897, and area under the receiver operating characteristic AUROC = 0.807. The best prediction results were achieved with the structured dataset, followed by the unstructured dataset, while the combined dataset achieved the poorest performance. ML algorithms for BC recurrence prediction are valuable tools to improve patient risk stratification, help with post-cancer monitoring, and plan more effective follow-up. Structured data provides the best results when fed to ML algorithms. However, an approach based on natural language processing offers comparable results while potentially requiring less mapping effort.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。