Abstract
Background/Objectives: Emotion classification based on eye-movement features has become a widely adopted approach due to the simplicity of data acquisition and the strong association between ocular responses and emotional states. However, several challenges remain with regard to existing emotion recognition methods, including the relatively weak correlation between eye-movement features and emotional labels and the fact that the key features are not prominently presented. Methods: To address abovelimitations, this study proposes an improved CNN-transformer combined with enhanced deep canonical correlation analysis network (ICTD). The proposed method first performs preprocessing and reconstruction of raw eye-movement signals to extract informative features. Subsequently, convolutional neural networks (CNNs) and transformer architectures are employed to capture local and global feature, respectively. In addition, an incremental feature feedforward network is incorporated to enhance the transformer, enabling the model to assign higher importance to salient feature information. Finally, the extracted representations are processed through deep canonical correlation analysis based on cosine similarity in order to generate classification outcomes. Results: Experiments conducted on the SEED-IV, SEED-V, and eSEE-d datasets demonstrate that the proposed ICTD framework consistently outperforms baseline approaches and attains optimal classification results. (1) On the eSEE-d dataset, the results of three-category arousal and valence classification reach 81.8% and 85.2%, respectively; (2) on the SEED-IV dataset, the emotion four-category classification result reaches 91.2%; (3) finally, on the SEED-V dataset, the emotion five-category classification result reaches 85.1%. Conclusions: The proposed ICTD framework effectively improves feature representation and classification performance, showing strong potential for practical emotion recognition and physiological signal analysis.