Abstract
Facial Expression Recognition (FER) demonstrates significant value in practical scenarios such as intelligent human-computer interaction. However, conventional FER methods often struggle to balance performance and efficiency in resource-constrained environments. Specifically, CNN-based methods struggle to capture global dependencies due to their limited receptive fields, while transformer-based methods suffer from quadratic computational complexity caused by self-attention mechanisms. To address these challenges, we propose a lightweight and efficient framework termed FERMam. The proposed model integrates dual-source and multi-scale features through an image fusion encoder, a facial landmark branch, and a pyramid fusion structure. The image fusion encoder combines CNN and Mamba-based selective state-space modeling to capture local structural information and global dependencies, respectively. The facial landmark branch enhances geometry-aware feature representation, and the pyramid fusion structure incorporates an Adaptive State-space Feature Refinement (ASFR) module to facilitate cross-source and cross-scale interactions with minimal computational overhead. Extensive experiments are conducted on three benchmark datasets: RAF-DB, AffectNet, and FERPlus. The results have shown that FERMam uses 62.81M fewer parameters (Param) and 9.73G fewer floating point operations (FLOPs) than POSTER, and 16.7M fewer parameters and 2.43G fewer than POSTER++, while achieving almost the same accuracy on three datasets. These results indicate that FERMam is well-suited for deployment in resource-constrained environments. The code is available at https://github.com/jxcsglr/FERMam .