Abstract
Accurately capturing the evolving temporal correlations between unstructured textual features and multi-modal parameter data is pivotal for robust equipment health assessment. Conventional multimodal fusion methods typically fail to capture temporal variations across modalities: text features exhibit stage-specific changes, such as slight abnormal noise in the early fault stage and severe vibration in the late degradation stage, while parameter data contains latent temporal patterns like wavelet energy accumulation in specific frequency bands during the fault precursor period. The attention mechanism is a highly promising architecture to address this issue. This study proposes a dynamic attention-driven multimodal feature fusion method for equipment health status assessment. This method integrates a hybrid time-frequency encoding framework, combining wavelet packet decomposition (WPD), fast Fourier transform (FFT), and discrete Fourier transform (DFT) with textual feature extraction based on bidirectional encoder representations from transformers (BERT). On the Case Western Reserve University (CWRU) bearing fault dataset, the proposed method improves classification accuracy by 7.2% compared with conventional symmetric attention models and achieves an AUC-ROC of 0.951. The proposed method captures evolving multimodal signals, allowing for more accurate and interpretable health assessment, thereby providing valuable technical support for real-time monitoring and preventive maintenance of equipment.