Abstract
Video emotion recognition (VER), situated at the convergence of affective computing and computer vision, aims to predict the primary emotion evoked in most viewers through video content, with extensive applications in video recommendation, human-computer interaction, and intelligent education. This paper commences with an analysis of the psychological models that constitute the foundation of VER theory. The paper further elaborates on datasets and evaluation metrics commonly utilized in VER. Then, the paper reviews VER algorithms according to their categories, and compares and analyzes the experimental results of classic methods on four datasets. Based on a comprehensive analysis and investigations, the paper identifies the prevailing challenges currently faced in the VER field, including gaps between emotional representations and labels, large-scale and high-quality VER datasets, and the efficient integration of multiple modalities. Furthermore, this study proposes potential research directions to address these challenges, e.g., advanced neural network architectures, efficient multimodal fusion strategies, high-quality emotional representation, and robust active learning strategies.