Abstract
Accurate segmentation of navigable waters and obstacles is critical for unmanned surface vessel navigation yet remains challenging in real aquatic environments characterized by complex water textures and blurred boundaries. Current models often struggle to simultaneously capture long-range contextual dependencies and fine spatial details, frequently leading to fragmented segmentation results. In order to resolve these issues, we present a novel segmentation model based on the CoAtNet architecture. Our framework employs an enhanced convolutional attention encoder, where a Fused Mobile Inverted Bottleneck Convolution (Fused-MBConv) module refines boundary features while a Convolutional Block Attention Module (CBAM) enhances feature awareness. The model incorporates a Bi-level Former (BiFormer) to enable collaborative modeling of global and local features, complemented by a Multi-scale Attention Aggregation (MSAA) module that effectively captures contextual information across different scales. The decoder, based on U-Net, restores spatial resolution gradually through skip connections and upsampling. In our experiments, the model achieves 95.15% mIoU on a self-collected dataset and 98.48% on the public MaSTr1325 dataset, outperforming DeepLabV3+, SeaFormer, and WaSRNet. These results show the model's ability to effectively interpret complex aquatic environments for autonomous navigation.