Abstract
The recognition of the physical activities of humans, especially in sports events such as boxing, is an intricate issue that has been addressed mainly by the traditional models of videos without the input of psychological dynamics. Other than this, mental states like anxiety, confidence, and focus have been found to impact performance massively, the use of which has been underdeveloped in current deep learning frameworks. This study proposes a multimodal deep learning framework that combines psychological profiling with video-based boxing action recognition. The approach is designed to overcome the shortcomings of existing visual analysis models, which fail to disengage mechanically similar actions because of their differing contextual backgrounds. The proposed framework combines 3D-ResNet for spatiotemporal feature extraction from boxing videos with a BERT-based encoder for athlete psychological profiles, and the resulting representations are fused at the feature level for classification. Experiments were conducted using the HMDB51-Boxing subset and the newly constructed PsyBox-20 dataset, which links psychological states with action instances through standardized self-report scales. Results demonstrate that the multimodal model achieves an accuracy of 91.2% and an F1-score of 90.9%, outperforming video-only and psychology-only baselines as well as several state-of-the-art unimodal methods. Further analysis shows that psychological characteristics are especially appreciable at distinguishing between visually similar actions, e.g., between jab and hook, where context and cognitive condition play a key role in action execution. It is necessary to mention that the current framework does not support real-time deployment and is aimed to be developed in the future. However, the obtained results validate the hypothesis that psychological profiling adds accuracy to recognition and gives helpful information to AI-led sports analytics and coaching behaviors.