Abstract
Drone-based object detection faces critical challenges, including tiny objects, complex urban backgrounds, dramatic scale variations, and high-frequency detail loss during feature propagation. Current detection methods struggle to address these challenges while maintaining computational efficiency effectively. We propose Scale-Frequency Detection Transformer (SF-DETR), a novel end-to-end framework for drone-view scenarios. SF-DETR introduces a lightweight ScaleFormerNet backbone with Dual Scale Vision Transformer modules, a Bilateral Interactive Feature Enhancement Module, and a Multi-Scale Frequency-Fused Feature Enhancement Network. Extensive experiments on the VisDrone2019 dataset demonstrate SF-DETR's superior performance, achieving 51.0% mAP50 and 31.8% mAP50:95, surpassing state-of-the-art methods like YOLOv9m and RTDETR-r18 by 6.2% and 4.0%, respectively. Further validation of the HIT-UAV dataset confirms the model's generalization capability. Our work establishes a new benchmark for drone-view object detection and provides lightweight architecture suitable for embedded device deployment in real-world aerial surveillance applications.