Abstract
Driver distraction is a key factor contributing to traffic accidents. However, in existing computer vision-based methods for driver attention state recognition, monocular camera-based approaches often suffer from low accuracy, while multi-sensor data fusion techniques are compromised by poor real-time performance. To address these limitations, this paper proposes a Real-time Driver Attention State Recognition method (RT-DASR). RT-DASR comprises two core components: Binocular Vision Depth-Compensated Head Pose Estimation (BV-DHPE) and Multi-source Temporal Bidirectional Long Short-Term Memory (MSTBi-LSTM). BV-DHPE employs binocular cameras and YOLO11n (You Only Look Once) Pose to locate facial landmarks, calculating spatial distances via binocular disparity to compensate for monocular depth deficiency for accurate pose estimation. MSTBi-LSTM utilizes a lightweight Bidirectional Long Short-Term Memory (Bi-LSTM) network to fuse head pose angles, real-time vehicle speed, and gaze region semantics, bidirectionally extracting temporal features for continuous attention state discrimination. Evaluated under challenging conditions (e.g., illumination changes, occlusion), BV-DHPE achieved 44.7% reduction in head pose Mean Absolute Error (MAE) compared to monocular vision methods. RT-DASR achieved 90.4% attention recognition accuracy with 21.5 ms average latency when deployed on NVIDIA Jetson Orin. Real-world driving scenario tests confirm that the proposed method provides a high-precision, low-latency attention state recognition solution for enhancing the safety of mining vehicle drivers. RT-DASR can be integrated into advanced driver assistance systems to enable proactive accident prevention.