Abstract
Object detection, a cornerstone of computer vision, aims to localize and classify objects within images. This comprehensive survey reviews modern object detection methods, focusing on two dominant paradigms: Convolutional Neural Networks (CNNs) and Transformer-based architectures. This work provides a structured comparison of CNN-based and Transformer-based detection paradigms, highlighting their complementary strengths and trade-offs. CNNs demonstrate advantages in local feature extraction and computational efficiency, whereas Transformers excel at capturing global context through self-attention mechanisms. We also analyze multi-modal fusion techniques integrating Red-Green-Blue (RGB), Light Detection and Ranging (LiDAR), and language embeddings. Benchmark results from representative models include: Real-Time Detection Transformer (RT-DETR) achieves 53.1% mean Average Precision (mAP) at Intersection over Union (IoU) at 0.5 : 0.95, You Only Look Once version 8 (YOLOv8) achieves 50.2% mAP at 0.5:0.95, real-time detectors exceed 100 frames per second (FPS) with competitive accuracy, and specialized infrared methods achieve 92.45% F-measure on NUAA-SIRST dataset. The work introduces a novel taxonomy of multi-modal fusion strategies, documents field-wide and review-specific limitations, and synthesizes recent 2024 to 2025 benchmarks across diverse datasets. Despite these advances, significant challenges remain in handling scale variation, occlusion effects, and domain adaptation. This survey outlines these persistent obstacles and promising research directions, providing a structured reference for researchers and practitioners.