Abstract
With the rapid acceleration of urbanization and the increasing volume of road traffic, emergency vehicles frequently encounter congestion when performing urgent tasks. Failure to yield in a timely manner can result in the loss of critical rescue time. Therefore, this study aims to develop a lightweight and high-precision RT-DETR-EVD emergency vehicle detection model to enhance urban emergency response capabilities. The proposed model replaces ResNet with a lightweight CSPDarknet backbone and integrates an innovative hybrid C2f-MogaBlock architecture. A multi-order gated aggregation mechanism is introduced to dynamically fuse multi-scale features, improving spatial-channel feature representation while reducing the number of parameters. Additionally, an Attention-based Intra-scale Feature Interaction Dynamic Position Bias (AIDPB) module is designed, replacing fixed positional encoding with learnable dynamic position bias (DPB), improving feature discrimination in complex scenarios. The experimental results demonstrate that the improved RT-DETR-EVD model achieves superior performance in emergency vehicle detection under the same training conditions. Specifically, compared to the baseline RT-DETR-r18 model, RT-DETR-EVD reduces parameter count to 14.5 M (a 27.1% reduction), lowers floating-point operations (FLOPs) to 49.5 G (a 13.2% reduction), and improves precision by 0.5%. Additionally, recall and mean average precision (mAP50%) increase by 0.6%, reaching an mAP50% of 88.3%. The proposed RT-DETR-EVD model achieves a breakthrough balance between accuracy, efficiency, and scene adaptability. Its unique lightweight design enhances detection accuracy while significantly reducing model size and accelerating inference. This model provides an efficient and reliable solution for smart city emergency response systems, demonstrating strong deployment potential in real-world engineering applications.