Abstract
Auto-segmentation quality or accuracy influences their clinical usefulness. However, currently widely utilized segmentation metrics (e.g., Dice Coefficient (DC) and Hausdorff Distance (HD)) cannot effectively express the manual mending effort required when utilizing auto-segmentation results in clinical practice. In this article, we explore ways of evaluating auto-segmentations with clinical efficiency considerations in mind. The time required for correcting auto-segmentations by experts is recorded to indicate ground-truth mending effort. Extended from our previous work, five explicitly-defined metrics are studied in detail for their ability to predict mending effort. More importantly, we explore the use of deep learning networks to provide an implicit metric, which predict mending effort using auto-segmentation masks and original images as input. A 3-institution evaluation is conducted with 7 different anatomic organs in the setting of auto-contouring for radiation therapy planning. Among the five explicit metrics, one form of the proposed Mendability Index (MIhd) shows the best performance to indicate the mending effort for sparse objects with 6.2-14.4% error, while one form of HD (sHD) performs best when assessing large non-sparse objects. Interestingly, while the explicit metrics all require ground truth segmentations for estimating mending effort, the implicit models obtained via deep learning are effective in predicting mending efforts (with 2.9-12.9% error) without the need for ground-truth segmentations and directly from the given image plus the auto-segmentations. We conclude that once effort-predicting deep models are created, it is feasible to assess the clinical usability of new segmentation models, going beyond bench technical evaluation commonly done via explicit metrics.