Abstract
RGB-infrared (RGB-IR) object detection leverages complementary information from these two modalities to substantially enhance perception in complex environments, which is particularly beneficial for reliable detection under adverse imaging conditions such as low illumination and severe haze. However, RGB-IR object detection still faces several challenges due to pronounced intra-modality and cross-modality discrepancies. On the one hand, many existing approaches rely on complex architectures to strengthen cross-modal interactions, which increases computational cost. On the other hand, symmetric dual-branch backbones with a static fusion paradigm often struggle to explicitly characterize discrepancies between the RGB and IR modalities. This limitation prevents effective mining of complementary information and reduces the discriminability of fused representations. To address these issues, this paper presents a lightweight RGB-IR multimodal detection network (LMDENet), which consists of three key components: (1) an illumination-guided label selection (IGLS) that integrates RGB and IR labels based on cross-modal matching and illumination-aware rules to construct consistent and reliable supervision; (2) a heterogeneous backbone network (HBN) with differentiated branches that separately model RGB appearance details and IR structural information, improving modality-specific representation learning; and (3) a difference-complement enhancement module (DCEM) that explicitly decomposes cross-modal features into common and difference components and performs selective enhancement to amplify complementary information while suppressing redundant noise. We systematically evaluate the detection performance of the proposed model on the multimodal remote sensing dataset DroneVehicle, and further conduct supplementary experiments on the LLVIP dataset to verify its generalization ability across different scenarios. Experimental results on the DroneVehicle and LLVIP datasets demonstrate that LMDENet achieves 78.9% and 93.6% mAP@0.5, respectively. Meanwhile, the model contains only 3.3 M parameters and 8.7 G FLOPs, reflecting a favorable accuracy-efficiency balance.