Abstract
Panoptic segmentation is crucial for surgical scene understanding but remains a significant challenge. This is particularly due to the high cost of annotation, which often results in class imbalance in existing datasets, leading to poor performance on categories with limited samples. To address it, we proposed a generalized few-shot MM-former, which is a three-stage framework: (1) We build surgical image-text pairs from the CholecT50 dataset. Using these data, we fine-tune the stable diffusion model to extract multi-scale, image-text fused representations. (2) We train an Mask2Former-based panoptic segmentation decoder on the base classes with sufficient samples, and use it to transform the representations of each image to a set of mask proposals with category predictions. (3) We propose an N-to-M mask matching method. Given a small set of samples from N novel classes, we extract their features as guidance to match M mask proposals, enabling identification of all novel class objects in a single pass. Specifically, each matched proposal is updated with the most likely novel class, while the others keep original predictions. Finally, all proposals are merged to output the results. On CholecPanSeg, our newly built surgical panoptic dataset, the method achieves outstanding results under limited data, surpassing previous approaches.