Abstract
Background/Objectives: This study proposes CMT-BUSNet, a hybrid architecture integrating CNN, Mamba, and Transformer branches for breast ultrasound tumor segmentation with built-in explainability. Methods: CMT-BUSNet employs a CNN-anchored hierarchical parallel encoder where Mamba and Transformer branches process CNN-derived features in parallel, fused through an Adaptive Feature Fusion Module (AFFM) with Dense Nested Decoder and Boundary-Aware Composite Loss. Five-fold cross-validation on BUS-BRA (N = 1875) compared nine architectures under identical protocols, plus nnU-Net v2 trained with its default self-configuring protocol as a benchmark. External evaluation used the BUSI dataset (N = 647). Results: CMT-BUSNet achieved DSC = 0.9037 ± 0.0047 on BUS-BRA with higher boundary delineation metrics than nnU-Net v2, which was trained under a different self-configuring protocol (B-IoU: 0.611 vs. 0.557; HD95: 10.07 vs. 13.54 pixels), despite nnU-Net’s marginally higher DSC (0.9108). On BUSI, CMT-BUSNet (DSC = 0.6709) yielded higher scores than nnU-Net (0.5579) across all metrics under zero-shot transfer, though the two methods were trained under different protocols. Training-based ablation confirmed each component’s contribution, and quantitative XAI validation demonstrated attribution faithfulness (nEAR = 2.82×) and uncertainty–error correlation (r = 0.39). Conclusions: CMT-BUSNet achieves competitive accuracy with higher boundary metrics, preliminary cross-dataset transferability, and built-in interpretability relative to nnU-Net (noting different training protocols). Internal validation folds are image-disjoint but not guaranteed to be patient-disjoint, which should be considered when interpreting the reported metrics. Multicenter validation is required before clinical deployment.