Abstract
The Transformer-based target detection model, DETR, has powerful feature extraction and recognition capabilities, but its high computational and storage requirements limit its deployment on resource-constrained devices. To solve this problem, we first replace the ResNet-50 backbone network in DETR with Swin-T, which realizes the unification of the backbone network with the Transformer encoder and decoder under the same Transformer processing paradigm. On this basis, we propose a quantized inference scheme based entirely on integers, which effectively serves as a data compression method for reducing memory occupation and computational complexity. Unlike previous approaches that only quantize the linear layer of DETR, we further apply integer approximation to all non-linear operational layers (e.g., Sigmoid, Softmax, LayerNorm, GELU), thus realizing the execution of the entire inference process in the integer domain. Experimental results show that our method reduces the computation and storage to 6.3% and 25% of the original model, respectively, while the average accuracy decreases by only 1.1%, which validates the effectiveness of the method as an efficient and hardware-friendly solution for target detection.