Abstract
Head pose estimation is a fundamental task in the field of computer vision, serving as an effective method to roughly determine a person's gaze direction. However, accurate head pose estimation remains a huge challenge due to occlusion and low resolution. To address this challenge, this paper proposes a novel framework that combines classification and regression paradigms for head pose estimation. To begin with, we design a novel soft-label generation strategy for classification. This strategy first generates 3D facial models from different angles and then measures the similarity between poses by utilizing the displacements of 3D key points from different views. Additionally, we introduce the Stacked Dual Attention Module (SDAM), which includes the Multi-Receptive Attention Module (MRAM) and the Channel-wise Self-Attention Module (CSAM). MRAM uses convolution kernels of different sizes and explore multiple contextual semantics to perceive key features. CSAM employs a self-attention mechanism to adaptively model inter-channel dependencies, achieving effective channel attention. The design of SDAM takes into account the characteristics of the task itself, enabling it to extract more representative features and to be easily deployed in mainstream network architectures (e.g., ResNet). Extensive experiments on popular datasets demonstrate the competitiveness of our method. Furthermore, we apply the proposed head pose method to approximate and estimate students' gaze points in large classroom scenarios.