Abstract
Object detection in high-resolution aerial imagery is challenging due to scale changes, occlusion, clutter, and limited annotated datasets. While CNNs like YOLO and Faster R-CNN have progressed, they lack effective long-range dependency capture. We propose the CNN augmented detection transformer approach which we called CATR. In our quest, we compared the proposed framework with the transformer-based DETR and state-of-the-art CNNs on the DOTA dataset. DETR, with its end-to-end transformer and direct set predictions, streamlines the pipeline by removing anchor boxes and non-maximum suppression, improving robustness in cluttered aerial scenes. Our findings show DETR's superior accuracy (72% mAP@0.5), outperforming CNNs by up to 13%. However, DETR has higher computational expense (86.3 GFLOPs) and slower speed (12 FPS). The proposed hybrid CNN-transformer architecture has a balanced accuracy and speed, exploiting CNN features with global attention for improved small object detection, augmented by the segmentation by CNN. This study confirms transformer models, especially when combined with CNN, are highly promising for complex aerial environments, offering a strong alternative to traditional CNNs by globally modeling context and occlusion. While efficiency improvements are ongoing, this research provides a valuable path for future geospatial applications, including remote sensing and disaster response.