Abstract
Object detection in visible light (RGB) images is frequently compromised by low-illumination conditions, whereas infrared (IR) imaging typically exhibits superior robustness in such environments. Multispectral fusion addresses this limitation by leveraging complementary information from both modalities; however, existing methods predominantly rely on intricate fusion modules to integrate cross-modal features, inevitably incurring significant computational overhead and architectural complexity. To mitigate this issue, we propose a novel Cross-modal Orthogonal Representation Enhancement Network (CORE-Net). Diverging from conventional heavy-fusion paradigms, our framework adopts a dual-branch architecture integrated with a streamlined Cross-modal Concatenation Network Framework (CCNF), which achieves efficient feature integration while substantially reducing model complexity. Furthermore, CORE-Net incorporates two distinct components-the Multiple Pooling Convolution Downsampling (MPCD) module and the Refined Integration Network (RINet)-specifically designed to optimize feature extraction capabilities. Extensive evaluations on the DroneVehicle and LLVIP datasets demonstrate that CORE-Net achieves state-of-the-art (SOTA) performance in terms of both detection accuracy and computational efficiency. Ablation studies substantiate the individual and synergistic contributions of each proposed component, while deployment on edge devices further corroborates the model's practical efficiency. Additionally, qualitative visualizations confirm the model's efficacy in suppressing background noise and enhancing discriminative fine-grained features. In summary, CORE-Net establishes a robust new paradigm for high-performance and efficient multispectral object detection.