Abstract
With the increasingly prominent mental health issues in universities, how to achieve rapid and objective identification of psychological states has become a research focus. The research proposes a mental health assessment model for college students that integrates facial expression recognition and deep learning technology, combining dynamic and static facial expression information to raise the accuracy and efficiency of recognition. This model is based on a dual branch structure, using a hierarchical sliding window Transformer with optical flow graph input to extract dynamic features, and MobileViT with convolution and attention fusion to extract static peak frame features. Cross branch feature interaction and fusion are achieved through a cross attention module. The findings denote that the accuracy of the proposed model reaches 0.945, the F1 value is 0.934, and the AUC value is as high as 0.963. At the same time, the average response time of the model is only 2.2 s, and the training time is 2.6 s, which has better computational efficiency while ensuring accuracy. The proposed method shows significant advantages in the accuracy of abnormal state recognition, training stability, and generalization ability. The research results indicate that the fusion model can serve as an effective supplement to the psychological health warning system in universities, providing intelligent technical support for emotion recognition driven psychological interventions.