Abstract
Micro-expression recognition (MER) is challenged by a brief duration, low intensity, and heterogeneous spatial frequency patterns. This study introduces a novel MER architecture that reduces computational cost by fine-tuning a large feature extraction model with LoRA, while integrating frequency-domain transformation and graph-based temporal modeling to minimize preprocessing requirements. A Spatial Frequency Adaptive (SFA) module decomposes high- and low-frequency information with dynamic weighting to enhance sensitivity to subtle facial texture variations. A Dynamic Graph Attention Temporal (DGAT) network models video frames as a graph, combining Graph Attention Networks and LSTM with frequency-guided attention for temporal feature fusion. Experiments on the SAMM, CASME II, and SMIC datasets demonstrate superior performance over existing methods. On the SAMM 5-class setting, the proposed approach achieves an unweighted F1 score (UF1) of 81.16% and an unweighted average recall (UAR) of 85.37%, outperforming the next best method by 0.96% and 2.27%, respectively.