Abstract
Artificial intelligence-driven educational systems have largely prioritised cognitive adaptation, often neglecting the critical role of learners' emotional states in shaping engagement and learning outcomes. To address this limitation, this study proposes a multimodal, emotion-aware deep learning framework designed to integrate emotional intelligence into intelligent learning environments. The framework jointly analyses facial expressions, speech characteristics, and textual responses to infer learners' emotional states and models the interdependencies among these modalities through a graph-based fusion mechanism. The proposed approach is evaluated using benchmark emotion datasets, namely AffectNet and IEMOCAP, to assess its capability to recognise emotional patterns and support adaptive feedback during learning interactions. Experimental results demonstrate that incorporating emotional awareness leads to substantial improvements in learner engagement, emotional regulation, and task persistence when compared with conventional cognition-focused systems. The framework achieves consistently high emotion recognition performance, particularly for positive and neutral affective states, and shows robust generalisation across different emotion categories. User study outcomes further suggest that learners perceive the system as more supportive and responsive due to its emotional adaptability. In addition to performance evaluation, the study discusses key ethical considerations associated with emotion-aware educational technologies, including data privacy, informed consent, and responsible deployment. Overall, the findings underscore the potential of multimodal emotional intelligence to advance the development of more empathetic, adaptive, and effective artificial intelligence-based educational systems.