Abstract
Steady-State Visual Evoked Potentials (SSVEPs) have emerged as an efficient means of interaction in brain-computer interfaces (BCIs), achieving bioinspired efficient language output for individuals with aphasia. Addressing the underutilization of frequency information of SSVEPs and redundant computation by existing transformer-based deep learning methods, this paper analyzes signals from both the time and frequency domains, proposing a stacked encoder-decoder (SED) network architecture based on an xLSTM model and spatial attention mechanism, termed SED-xLSTM, which firstly applies xLSTM to the SSVEP speller field. This model takes the low-channel spectrogram as input and employs the filter bank technique to make full use of harmonic information. By leveraging a gating mechanism, SED-xLSTM effectively extracts and fuses high-dimensional spatial-channel semantic features from SSVEP signals. Experimental results on three public datasets demonstrate the superior performance of SED-xLSTM in terms of classification accuracy and information transfer rate, particularly outperforming existing methods under cross-validation across various temporal scales.