Abstract
Recognizing emotions accurately in virtual reality (VR) enables adaptive and personalized experiences across gaming, therapy, and other domains. However, most existing facial emotion recognition models rely on acted expressions collected under controlled settings, which differ substantially from the spontaneous and subtle emotions that arise during real VR experiences. To address this challenge, the objective of this study is to develop and evaluate generalizable emotion recognition models that jointly learn from both acted and natural facial expressions in virtual reality. We integrate two complementary datasets collected using the Meta Quest Pro headset, one capturing natural emotional reactions and another containing acted expressions. We evaluate multiple model architectures, including convolutional and domain-adversarial networks, and a mixture-of-experts model that separates natural and acted expressions. Our experiments show that models trained jointly on acted and natural data achieve stronger cross-domain generalization. In particular, the domain-adversarial and mixture-of-experts configurations yield the highest accuracy on natural and mixed-emotion evaluations. Analysis of facial action units (AUs) reveals that natural and acted emotions rely on partially distinct AU patterns, while generalizable models learn a shared representation that integrates salient AUs from both domains. These findings demonstrate that bridging acted and natural expression domains can enable more accurate and robust VR emotion recognition systems.