Abstract
To address the limitations of single-modal approaches in bearing fault diagnosis under complex operating conditions, this study proposes SCBM-Net-a novel deep learning model based on a dual-channel multimodal fusion architecture. The model innovatively combines Continuous Wavelet Transform (CWT) and Variational Mode Decomposition (VMD) to extract complementary features from time-frequency images and temporal signals, respectively. Specifically, the first channel employs a Swin Transformer to effectively model both local and global representations of CWT-based images through a hierarchical window-based attention mechanism. The second channel adopts a CNN-BiGRU-Attention network to dynamically capture temporal dependencies from intrinsic mode functions decomposed by VMD. Features from both channels are deeply fused using a Multimodal Compact Bilinear Pooling (MCB) module, enhancing fault feature representation and overall model robustness. Experimental results on the CWRU dataset show that SCBM-Net achieves an accuracy of 99.83% under clean conditions. Even under a few-shot learning setting with only 60 training samples per class, the model still maintains a high recognition accuracy of 98.64%, demonstrating strong generalization in low-data scenarios. On an imbalanced dataset, SCBM-Net exhibits stable performance for both majority and minority classes, achieving an average accuracy of 97.33%. In a generalization test on the SEU bearing dataset, the model achieves an accuracy of 98.33%, further validating its cross-platform and cross-domain robustness and transferability. Moreover, under severe noise interference at - 10 dB, SCBM-Net retains a fault recognition accuracy of 80.67%, outperforming comparable models and demonstrating excellent noise robustness and practical applicability.