Abstract
Multimodal physiological emotion recognition is challenged by modality heterogeneity and inter-subject variability, which hinder model generalization and robustness. To address these issues, this paper proposes a new framework, Cross-modal Transformer with Enhanced Learning-Classifying Adversarial Network (CT-ELCAN). The core idea of CT-ELCAN is to shift the focus from conventional signal fusion to the alignment of modality- and subject-invariant emotional representations. By combining a cross-modal Transformer with ELCAN, a feature alignment module using adversarial training, CT-ELCAN learns modality- and subject-invariant emotional representations. Experimental results on the public datasets DEAP and WESAD demonstrate that CT-ELCAN achieves accuracy improvements of approximately 7% and 5%, respectively, compared to state-of-the-art models, while also exhibiting enhanced robustness.