Spatiotemporal multimodal emotion recognition using Temporal video sequences and pose features for child emotion classification

基于时序视频序列和姿态特征的时空多模态情绪识别及其在儿童情绪分类中的应用

阅读:1

Abstract

Developmental psychology and affective computing have placed great emphasis on identifying children's emotional cues in recent times. In this study, a novel Spatio-Temporal Multimodal Emotion Recognition Network (ST-MERN) for child emotion classification is proposed. Dense feature embeddings of the EmoReact dataset and temporal video sequences are utilized for the study. The proposed method uses 115 continuous frames per visual signal instance, e.g., rotational-translational vectors, facial keypoints, and pose predictions. With steady performance on each frame and a mean confidence of 0.967, this ensures the system maintains good detection fidelity. In order to track subtle emotional changes, our method captures dynamic data like scale variation and frame-to-frame variation (r(x), r(y), r(z), t(x), t(y)). Latent features (p24-p33) provide a profound explanation of emotional states. The model is designed to preserve spatiotemporal consistency and improve emotion recognition by combining these features. Curiosity, uncertainty, excitement, happiness, surprise, disgust, fear, frustration, and valence are the nine categories on which the system categorizes children's emotional states. Preliminary results show that our system effectively captures expressive nuances, with stable pose data and low feature variability across sequences. The model surpassed earlier models such as LSTM and TCN in generalization, with a high validation accuracy of 93.6% and test accuracy of 94.3% for the BiLSTM-based architecture. The BiLSTM model had enhanced classification capacity for different emotional states with an F1-score of 0.92. The TCN model is well-suited to real-time deployment since it recorded a competitive test accuracy of 91.7% with quick inference times of ~ 0.8 s per clip, even though it was slightly slower than the BiLSTM. With an F1-score of 0.89 and test accuracy of 90.2%, the LSTM model performed robustly; it trained faster than the BiLSTM and TCN, although its accuracy was slightly lower. By providing strong and interpretable classification that is sensitive to the dynamic nature of children's emotional displays, this technique improves emotion detection in children. Our work provides the foundation for socially sensitive systems, therapy treatments, and affect-conscious education materials.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。