Abstract
Emotional and workload-related states unfold dynamically during immersive virtual reality (VR) experiences, yet reliable physiological modeling in such environments remains challenging. We investigated whether multi-channel facial electromyography (fEMG), combined with spatio-temporal deep learning, can (i) accurately classify calibrated facial expressions across participants and (ii) transfer to spontaneous, task-elicited behavior in immersive VR. Twelve adults completed a calibration phase involving four intentional expressions (smile, frown, raised eyebrow, neutral), followed by VR scenes designed to elicit emotional, cognitive, physical, and dual task demands. After participant-level physiological normalization, a single shared Convolutional Neural Network-Temporal Convolutional Network (CNN-TCN) model was trained and evaluated using leave-one-participant-out (LOPO) validation. The model achieved strong cross-participant performance (Macro-F1 = 0.88 ± 0.13; ROC-AUC = 0.95 ± 0.06). When applied to unlabeled spontaneous VR task-elicited fEMG recordings, the trained model generated continuous expression classes. Derived static and temporal expression features showed scene-dependent modulation and False Discovery Rate (FDR)-surviving associations, primarily with perceived physical demand (NASA-TLX). The observed muscle activation patterns were physiologically plausible and aligned with Facial Action Coding System (FACS)-based interpretations of underlying muscle activity. These findings demonstrate that end-to-end spatio-temporal modeling of raw fEMG enables facial expression sensing in immersive VR using a single shared model following physiological normalization. The proposed framework bridges calibrated expression learning and spontaneous task-elicited behavior, supporting privacy-preserving, continuous and physiologically grounded monitoring in human-centered VR applications.