Comparison of Artificial Intelligence Techniques to Evaluate Performance of a Classifier for Automatic Grading of Prostate Cancer From Digitized Histopathologic Images

比较人工智能技术以评估分类器在基于数字化组织病理图像的前列腺癌自动分级中的性能

阅读:2

Abstract

IMPORTANCE: Proper evaluation of the performance of artificial intelligence techniques in the analysis of digitized medical images is paramount for the adoption of such techniques by the medical community and regulatory agencies. OBJECTIVES: To compare several cross-validation (CV) approaches to evaluate the performance of a classifier for automatic grading of prostate cancer in digitized histopathologic images and compare the performance of the classifier when trained using data from 1 expert and multiple experts. DESIGN, SETTING, AND PARTICIPANTS: This quality improvement study used tissue microarray data (333 cores) from 231 patients who underwent radical prostatectomy at the Vancouver General Hospital between June 27, 1997, and June 7, 2011. Digitized images of tissue cores were annotated by 6 pathologists for 4 classes (benign and Gleason grades 3, 4, and 5) between December 12, 2016, and October 5, 2017. Patches of 192 µm2 were extracted from these images. There was no overlap between patches. A deep learning classifier based on convolutional neural networks was trained to predict a class label from among the 4 classes (benign and Gleason grades 3, 4, and 5) for each image patch. The classification performance was evaluated in leave-patches-out CV, leave-cores-out CV, and leave-patients-out 20-fold CV. The analysis was performed between November 15, 2018, and January 1, 2019. MAIN OUTCOMES AND MEASURES: The classifier performance was evaluated by its accuracy, sensitivity, and specificity in detection of cancer (benign vs cancer) and in low-grade vs high-grade differentiation (Gleason grade 3 vs grades 4-5). The statistical significance analysis was performed using the McNemar test. The agreement level between pathologists and the classifier was quantified using a quadratic-weighted κ statistic. RESULTS: On 333 tissue microarray cores from 231 participants with prostate cancer (mean [SD] age, 63.2 [6.3] years), 20-fold leave-patches-out CV resulted in mean (SD) accuracy of 97.8% (1.2%), sensitivity of 98.5% (1.0%), and specificity of 97.5% (1.2%) for classifying benign patches vs cancerous patches. By contrast, 20-fold leave-patients-out CV resulted in mean (SD) accuracy of 85.8% (4.3%), sensitivity of 86.3% (4.1%), and specificity of 85.5% (7.2%). Similarly, 20-fold leave-cores-out CV resulted in mean (SD) accuracy of 86.7% (3.7%), sensitivity of 87.2% (4.0%), and specificity of 87.7% (5.5%). Results of McNemar tests showed that the leave-patches-out CV accuracy, sensitivity, and specificity were significantly higher than those for both leave-patients-out CV and leave-cores-out CV. Similar results were observed for classifying low-grade cancer vs high-grade cancer. When trained on a single expert, the overall agreement in grading between pathologists and the classifier ranged from 0.38 to 0.58; when trained using the majority vote among all experts, it was 0.60. CONCLUSIONS AND RELEVANCE: Results of this study suggest that in prostate cancer classification from histopathologic images, patch-wise CV and single-expert training and evaluation may lead to a biased estimation of classifier's performance. To allow reproducibility and facilitate comparison between automatic classification methods, studies in the field should evaluate their performance using patient-based CV and multiexpert data. Some of these conclusions may be generalizable to other histopathologic applications and to other applications of machine learning in medicine.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。