Abstract
In the research of multi-modal emotion recognition, the validity of source data, multi-modal data interaction, and different task dependence are the three important elements for task completion. However, few models take all factors into account in a consistent framework. To replenish this part, we propose a multimodal data-enhanced Interaction Graph (IGEM). Distil-DCCRN is used to help the model learn more robust acoustic features for speech data enhancement. Unsupervised data is introduced for text data enhancement, which increases the diversity of the data set while keeping its original meaning unchanged. Using a small, dense, connected network for video data enhancement, it combines the characteristics of image and time series, and its enhancement strategy is more complex and diversified. For the deep fusion of different modal data, we introduce a cross-modal data coding interaction graph and regard the data of different modes as nodes in the graph, and they are connected through the interaction relationship between modes. Finally, based on the deep fusion representation of the cross-modal data coding interaction graph, accurate emotion classification can be carried out. Experiments were carried out on IEMOCAP and MELD benchmark datasets, and the accuracy rate reached 72% and 47.5% respectively. The superiority of the model is fully proved.