Abstract
Transformer-based human pose estimation methods have made encouraging progress in improving performance. However, the excellent performance of pose networks is often accompanied by heavy computational costs and large network scale. In order to deal with this problem, this paper proposes a High-accuracy and Efficient Vision Transformer for Human Pose Estimation (HEViTPose). Firstly, the concept of Patch Embedded Overlap Width (PEOW) is proposed to help understand the relationship between the amount of overlap and local continuity. By explicitly adjusting PEOW value, the model’s capacity to capture local continuity information is enhanced. Secondly, a Cascaded Group Spatial Reduction Multi-Head Attention (CGSR-MHA) is proposed, which improves memory efficiency through feature grouping, reduces computational cost through spatial reduction, and also improves network performance by retaining multiple low-dimensional attention heads. Finally, comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the HEViTPose model performs on par with the state-of-the-art models, but is more lightweight while possessing higher inference speed. Specifically, compared with HRNet with similar performance and inference speed, the proposed model reduces the number of parameters by 62.1% and the amount of computation by 43.4%. Compared with HRFormer with similar performance and network size, the inference speed is about 2.6 times faster. Code and models are available at https://github.com/ T1sweet/HEViTPose.