Abstract
Detailed comprehension of the spatial configuration of structures is pivotal for safeguarding and enhancing the natural environment, maintaining urban safety, optimizing the utilization of construction resources, and urban planning. However, most existing models fail to fully capitalize on the global contextual information derived from remote sensing imagery, particularly in intricate urban landscapes where the augmentation of background information to cater to the complexities of various scenarios inevitably leads to a diminution in the proportion of buildings relative to the total pixel count, thereby posing multi-scale challenges. To address this impediment, the present paper introduces an enhanced Attention Swin-UperNet (IASUNet) semantic segmentation model, tailored for precise building classification. Initially, a Convolutional Block Attention Module (CBAM) is seamlessly integrated between the encoder Swin Transformer module and the decoder UperNet module. This integration enhances the model's capacity to extract global contextual information from remote sensing images, both spatially and across various channels. Furthermore, the utilization of Focal Cross Entropy (FCE) loss as the loss function addresses the multi-scale issues by modulating the weights assigned to underrepresented classes. According to experimental data, the enhanced Attention Swin-UperNet model outperforms the original model in terms of average accuracy (mAcc) and mean Intersection over Union (mIoU) for building identification. This improvement underscores the model's effectiveness in precise building classification, thus contributing to the safeguarding and enhancement of the natural environment, maintenance of urban safety, optimization of construction resource utilization, and advancement of urban planning.