Abstract
Accurate polyp segmentation is crucial for computer-aided diagnosis and early detection of colorectal cancer. Whereas feature pyramid network (FPN) and its variants are widely used in polyp segmentation, inherent limitations existing in FPN include: (1) repeated upsampling degrades fine details, reducing small polyp segmentation accuracy and (2) naive feature fusion (e.g., summation) inadequately captures global context, limiting performance on complex structures. To address limitations, we propose a cascaded aggregation network (CANet) that systematically integrates multi-level features for refined representation. CANet adopts PVT transformer as the backbone to extract robust multi-level representations and introduces a cascade aggregation module (CAM) that enriches semantic features without sacrificing spatial details. CAM adopts a top-down enhancement pathway, where high-level features progressively guide the fusion of multiscale information, enhancing semantic representation while preserving spatial details. CANet further integrates a multiscale context-aware module (MCAM) and a residual-based fusion module (RFM). MCAM applies parallel convolutions with diverse kernel sizes and dilation rates to low-level features, enabling fine-grained multiscale extraction of local details and enhancing scene understanding. RFM fuses these local features with high-level semantics from CAM, enabling effective cross-level integration. Experiments show that CANet outperforms SOTA methods in in- and out-of-distribution tests.