Abstract
Background/Objectives: Multimodal neuroimaging, particularly the integration of electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS), has emerged as a key methodology for investigating brain function and classifying neural activity. However, the efficient fusion of these two signals remains a formidable challenge due to their significant spatio-temporal heterogeneity. This paper presents the BiSTF-Net, which integrates decoupled and bi-directional spatio-temporal fusion mechanisms to enhance the performance of cognitive task recognition. Methods: In BiSTF-Net, the spatial features of EEG and fNIRS are mutually guided and enhanced through an efficient bi-directional cross modal guidance (Bi-CMG). Then, the temporal latencies of fNIRS signals are aligned in a data-driven manner using adaptive temporal alignment (ATA). Subsequently, the aligned features are deeply fused into a modality-invariant, discriminative representation via a symmetric cross-attention fusion (SCAF) module. Results: Evaluated on the mental arithmetic (MA), motor imagery (MI), and word generation (WG) tasks, the BiSTF-Net achieves average accuracies of 83.33%, 82.09%, and 84.99% respectively. Conclusions: The BiSTF-Net exhibits superior performance compared to the existing methods, offers a robust and interpretable solution for multimodal EEG-fNIRS cognitive task classification, and provides a methodological foundation for future extensions to other multimodal data and broader real-world clinical applications.