Abstract
Semantic segmentation plays a critical role in understanding complex urban environments, particularly for autonomous driving applications. However, existing approaches face significant challenges under low-light and adverse weather conditions. To address these limitations, we propose CSANet (Context Spatial Awareness Network), a novel framework that effectively integrates RGB and thermal infrared (TIR) modalities. CSANet employs an efficient encoder to extract complementary local and global features, while a hierarchical fusion strategy is adopted to selectively integrate visual and semantic information. Notably, the Channel-Spatial Cross-Fusion Module (CSCFM) enhances local details by fusing multi-modal features, and the Multi-Head Fusion Module (MHFM) captures global dependencies and calibrates multi-modal information. Furthermore, the Spatial Coordinate Attention Mechanism (SCAM) improves object localization accuracy in complex urban scenes. Evaluations on benchmark datasets (MFNet and PST900) demonstrate that CSANet achieves state-of-the-art performance, significantly advancing RGB-T semantic segmentation.