Abstract
In UAV-based downstream tasks, intelligent interpretation of UAV images demands higher real-time performance and accuracy. However, achieving high-precision, real-time object detection in UAV images poses significant challenges due to the prevalence of small objects (e.g., persons and bicycles), uneven target distribution, occlusion, and other factors. Current UAV object detection algorithms lack comprehensive solutions to the multifaceted challenges encountered in real-world deployment scenarios, resulting in suboptimal performance. Moreover, direct application of mainstream real-time detection algorithms like the YOLO series to UAV images lead to a significant performance drop. To address these issues, this paper presents an enhanced real-time object detection network named YOLO-UD, which is built upon the YOLO11 architecture. Our approach aims to achieve superior feature representation through the effective integration of contextual information and adaptive multi-scale fusion. Specifically, we incorporate a novel C3kHR module, which employs dilated convolutions with varying rates to capture contextual information across multiple granularity hierarchy, enabling superior and richer multi-scale feature representation. Additionally, an efficient adaptive feature fusion network (EAFN) is designed to filter and prioritize key information from multi-scale feature layers and flexibly provide the detection head with the information needed for the detection process. A small object detection layer (SMDL) is also introduced to enhance the detection of small objects and provide rich information about small targets. Finally, extensive experiments on the VisDrone2019 and UAVDET datasets demonstrate that YOLO-UD achieves excellent balance between accuracy and inference speed, validating its effectiveness.