Abstract
Current applications of deep learning in renal pathology focused on anatomical structures with morphology, yet little research has focused on the performance of models, such as versatility, in regions with severe kidney damage. In this study, we explored the difference in modal/domain shift capabilities between CNN-based and Transformer-based models. Firstly, we adopted two splitting strategies-WSI-level and patch-level-to stimulate sampling on multiple modal data distribution (i.e., renal WSIs collected from multi-centers). Then, we trained multiple CNN- and Transformer-based models on each splitting scheme respectively. We compared cross-splitting performance and analyzed the effective factors of results. For further validation, all models were tested on an independent external dataset for sensitivity analysis on the degree of fibrosis and inflammation. In conclusion, at both the patch- and WSI-level, M2F-Swin-B substantially outperformed UNet-ResNet18 with an average Intersection over Union (A-IoU) and per-class IoU. Notably, M2F-Swin-B outperformed UNet-ResNet18 in areas of a higher degree of fibrosis and inflammation and, a higher IoU score of arteries. In this study, we developed a robust multi-class segmentation pipeline for kidney histology. Moreover, we showed that the attention mechanism in Mask2Former enables visibly crisper and more uniform segmentation, particularly when the data is inadequate.