Abstract
Accurate segmentation of underwater pipelines is essential for marine infrastructure inspection. However, deep learning models often struggle with extreme underwater conditions such as low light, sea snow, and sea fog, leading to poor generalization on unseen data. Existing approaches typically focus on either accuracy or computational efficiency, leaving the challenge of achieving an optimal balance between the two unresolved. This paper introduces a novel hybrid architecture, the Swin Transformer-EFSNet fusion network, which delivers state-of-the-art accuracy with significantly reduced computational complexity and strong generalization capability. The model employs a dual-encoder design: a lightweight Swin Transformer branch to capture contextual relationships and a modified EFSNet branch optimized for efficient local feature extraction. Their outputs are dynamically integrated using a three-head cross-attention fusion module which prioritizes salient spatial and contextual information before decoding the final segmentation mask. We also present the HOMOMO dataset, a new benchmark containing images with challenging conditions such as low light, fog, sea snow, and complex occlusions (e.g., pipelines buried under sand or covered by vegetation). Extensive experiments on HOMOMO and two public datasets demonstrate that our method outperforms strong baselines, including UNet, SwinUNet, TransUNet, Mask2Former, YOLOv5, YOLOv11, and YOLOv12. On HOMOMO, our model achieves a mIoU of 98.44% and an F-boundary of 82.01%, surpassing the best-performing method by 8.43% and 5.34%, respectively. Crucially, the proposed model exhibits outstanding generalization to unseen data, demonstrating robustness against domain shifts. By effectively balancing global and local processing, our hybrid design achieves high accuracy without imposing heavy computational costs. These results establish a new paradigm for efficient and reliable visual perception in underwater environments, paving the way for practical autonomous inspection systems.