Abstract
The fusion of data from visible (RGB) and infrared (IR) sensors is essential for robust all-day and all-weather object detection. However, existing methods often suffer from modality redundancy and noise interference. To address these challenges, we propose the Decoupled and Differentiated Attention Fusion Network (DDAF-Net). Architecturally, DDAF-Net employs a decoupled backbone with a Siamese weight-sharing strategy to extract modality-common features, while parallel branches capture modality-specific features. To effectively integrate these features, we design the Differentiated Attention Fusion Module (DAFM). First, we introduce Spatial Residual Unshuffle Embedding (SRUE) to achieve lossless downsampling while preserving global semantic information. Second, differentiated attention mechanisms are applied for feature enhancement: Dual-Norm Alignment Attention (DNAA) facilitates effective modal alignment and enhances semantic consistency in modality-common features, while Sparse Purification Attention (SPA) enables selective utilization of complementary information by suppressing noise and focusing on salient regions in modality-specific features. Finally, the Adaptive Complementary Fusion Module (ACFM) integrates these components by using modality-common features as a baseline and dynamically weighting the complementary modality-specific information. Extensive experiments on public datasets such as LLVIP and M(3)FD demonstrate that DDAF-Net achieves state-of-the-art performance. These results validate the effectiveness of our proposed decoupling-enhancement-fusion paradigm.