Abstract
Emotion recognition in conversation (ERC) is an important research direction in the field of human-computer interaction (HCI), which recognizes emotions by analyzing utterance signals to enhance user experience and plays an important role in several domains. However, existing research on ERC mainly focuses on constructing graph networks by directly modeling interactions on multimodal fused features, which cannot adequately capture the complex dialog dependency based on time, speaker, modalities, etc. In addition, existing multi-task learning frameworks for ERC do not systematically investigate how and where gender information is injected into the model to optimize ERC performance. To address the above problems, this paper proposes a Hierarchical Graph Fusion for ERC with Mid-Late Gender-aware Strategy (HGF-MiLaG). HGF-MiLaG uses hierarchical fusion graph to adequately capture intra-modal and inter-modal speaker dependency and temporal dependency. In addition, HGF-MiLaG explores the effect of the location of gender information injections on ERC performance, and ultimately employs a Mid-Late multilevel gender-aware strategy in order to allow the hierarchical graph network to determine the proportion of emotion and gender information in the classifier. Empirical results on two public multimodal datasets(i.e.,IEMOCAP and MELD), demonstrate that HGF-MiLaG outperforms existing methods.