Abstract
The pursuit of accurate yet computationally efficient object detection within remote sensing imagery remains a cornerstone for the advancement of intelligent interpretation systems. Although substantial progress has been achieved in recent years, prevailing approaches still exhibit notable deficiencies in three critical aspects: the discriminative capacity of feature representation, the depth of semantic modeling, and the effectiveness of multi-scale information fusion. These shortcomings become particularly pronounced when addressing small-scale targets, which are highly susceptible to omission or misclassification. In response to these limitations, this work introduces HyperFusion-DEIM, a cascaded detection paradigm specifically designed to simultaneously reinforce object-level representations, enrich contextual semantic dependencies, and optimize scale-aware feature integration. Central to this framework is the Multi-Path Attention Network (MAPNet), which augments shallow semantic cues and edge-texture sensitivity for small object recognition through the joint operation of the Multi-Path Attention Fusion (MPAF) module and the Shallow Robust Feature Downsampling (SRFD) mechanism. Complementing this, the Scale-Aware Feature Enhancement (SAFE) encoder incorporates a Multi-level Feature Concentration (MFC) module to achieve cross-layer geometric alignment, while the integration of Transformer layers with HyperACE enables the capture of long-range semantic correlations without compromising spatial fidelity. Empirical validation conducted on the SIMD and VEDAI benchmarks demonstrates the clear superiority of HyperFusion-DEIM over state-of-the-art lightweight detectors in both predictive accuracy and robustness. Specifically, the model attains 64.5% AP on SIMD, outperforming RT-DETR and DEIM by 4.8% and 4.6%, respectively, while sustaining a peak inference throughput of 296.33 FPS. On VEDAI, HyperFusion-DEIM surpasses YOLOv12 and YOLOv13 by margins of 4.9% and 8.0%, and exceeds RT-DETRv2 and DEIM by 2.5% and 8.5%, all while maintaining real-time operation at 79.7 FPS. This performance showcases HyperFusion-DEIM's practical viability for real-time detection, particularly in resource-constrained environments where both speed and accuracy are critical.