Abstract
Person re-identification aims to match images of the same individual across non-overlapping cameras by analyzing personal characteristics. Recently, Transformer-based models have demonstrated excellent capabilities and achieved breakthrough progress in this task. However, their high computational costs and inadequate capacity to capture fine-grained local features impose significant constraints on re-identification performance. To address these challenges, this paper proposes a novel Toward Efficient Transformer-based Person Re-identification (TE-TransReID) framework. Specifically, the proposed framework retains only the former L-th layer layers of a pretrained Vision Transformer (ViT) for global feature extraction while combining local features extracted from a pretrained CNN, thus achieving the trade-off between high accuracy and lightweight networks. Additionally, we propose a dual efficient feature-fusion strategy to integrate global and local features for accurate person re-identification. The Efficient Token-based Feature-Fusion Module (ETFFM) employs the gate-based network to learn fused token-wise features, while the Efficient Patch-based Feature-Fusion Module (EPFFM) utilizes a lightweight Transformer to aggregate patch-level features. Finally, TE-TransReID achieves a rank-1 of 94.8%, 88.3%, and 85.7% on Market1501, DukeMTMC, and MSMT17 with a parameter of 27.5 M, respectively. Compared to existing CNN-Transformer hybrid models, TE-TransReID maintains comparable recognition accuracy while drastically reducing model parameters, establishing an optimal equilibrium between recognition accuracy and computational efficiency.