Abstract
Small object detection, as an important research topic in computer vision, has been widely applied in aerial visual tasks such as remote sensing and UAV imagery. However, due to challenges such as small object size, large-scale variations, and complex backgrounds, existing detection models often struggle to capture fine-grained semantics and high-resolution texture information in aerial scenes, leading to limited performance. To address these issues, this paper proposes an efficient aerial small object detection model, EABI-DETR (Efficient Attention and Bi-level Integration DETR), based on the RT-DETR framework. The proposed model introduces systematic enhancements from three aspects: (1) A lightweight backbone network, C2f-EMA, is developed by integrating the C2f structure with an efficient multi-scale attention (EMA) mechanism. This design jointly models channel semantics and spatial details with minimal computational overhead, thereby strengthening the perception of small objects. (2) A P2-BiFPN bi-directional multi-scale fusion module is further designed to incorporate shallow high-resolution features. Through top-down and bottom-up feature interactions, this module enhances cross-scale information flow and effectively preserves the fine details and textures of small objects. (3) To improve localization robustness, a Focaler-MPDIoU loss function is introduced to better handle hard samples during regression optimization. Experiments conducted on the VisDrone2019 dataset demonstrate that EABI-DETR achieves 53.4% mAP@0.5 and 34.1% mAP@0.5:0.95, outperforming RT-DETR by 6.2% and 5.1%, respectively, while maintaining high inference efficiency. These results confirm the effectiveness of integrating lightweight attention mechanisms and shallow feature fusion for aerial small object detection, offering a new paradigm for efficient UAV-based visual perception.