The evolution of object detection from CNNs to transformers and multi-modal fusion

目标检测技术从卷积神经网络到Transformer模型和多模态融合的演变

阅读:2

Abstract

Object detection, a cornerstone of computer vision, aims to localize and classify objects within images. This comprehensive survey reviews modern object detection methods, focusing on two dominant paradigms: Convolutional Neural Networks (CNNs) and Transformer-based architectures. This work provides a structured comparison of CNN-based and Transformer-based detection paradigms, highlighting their complementary strengths and trade-offs. CNNs demonstrate advantages in local feature extraction and computational efficiency, whereas Transformers excel at capturing global context through self-attention mechanisms. We also analyze multi-modal fusion techniques integrating Red-Green-Blue (RGB), Light Detection and Ranging (LiDAR), and language embeddings. Benchmark results from representative models include: Real-Time Detection Transformer (RT-DETR) achieves 53.1% mean Average Precision (mAP) at Intersection over Union (IoU) at 0.5 : 0.95, You Only Look Once version 8 (YOLOv8) achieves 50.2% mAP at 0.5:0.95, real-time detectors exceed 100 frames per second (FPS) with competitive accuracy, and specialized infrared methods achieve 92.45% F-measure on NUAA-SIRST dataset. The work introduces a novel taxonomy of multi-modal fusion strategies, documents field-wide and review-specific limitations, and synthesizes recent 2024 to 2025 benchmarks across diverse datasets. Despite these advances, significant challenges remain in handling scale variation, occlusion effects, and domain adaptation. This survey outlines these persistent obstacles and promising research directions, providing a structured reference for researchers and practitioners.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。