Benchmark Pathology Report Text Corpus with Cancer Type Classification

癌症类型分类的基准病理报告文本语料库

阅读:1

Abstract

In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。