Abstract
This paper presents and evaluates methods of fusing semantic image segmentation predictions, and highlights a novel hybrid approach that combines spatial frequency and edge features. Tool-labeled endoscopy from sinus surgery served as the image dataset, while two methods of surgical tool segmentation via morphological polar transform provided distinct predictions. The morphological transform acted as an input pre-processing step prior to segmentation via the U-Net architecture. Two separate predictions were available for each image based on the transformation center: one at the surgical tool-tip (TT) and one at the surgical tool vanishing point (VP). The goal in this work was to systematically generate a superior segmentation by fusing information from the two aforementioned predictions. Improved segmentation performance in this domain is envisioned to enable vision-based force estimation in robot-assisted minimally invasive surgery (RMIS), where lack of reliable force and tactile feedback has continued to be an ongoing challenge. While methods for deep learning based segmentation fusion exist, such methods require extensive datasets and potentially obfuscate explainability. Thus, three approaches relying solely on low-level features to fuse grayscale segmentation predictions were proposed in this work: (1) gradient estimation, (2) Laplacian pyramid and (3) a modified spatial frequency method. The latter two demonstrated enhanced segmentation compared to original predictions. This work also explores explainability towards identifying candidate prediction pairs for fusion via unsupervised clustering as well as a ResNet-18 model. Cursory investigations into properties of the fused predictions provide insight into the potential use of the proposed methods in domains other than surgical tool segmentation.