Abstract
Wind turbines operate under harsh conditions, heightening the risk of rotating bearing failures. While fault diagnosis using acoustic or vibration signals is feasible, single-modal methods are highly vulnerable to environmental noise and system uncertainty, reducing diagnostic accuracy. Existing multi-modal approaches also struggle with noise interference and lack causal feature exploration, limiting fusion performance and generalization. To address these issues, this paper proposes CAVF-Net-a novel framework integrating bidirectional cross-attention (BCA) and causal inference (CI). It enhances Mel-Frequency Cepstral Coefficients (MFCCs) of acoustic and short-time Fourier transform (STFT) features of vibration via BCA and employs CI to derive adaptive fusion weights, effectively preserving causal relationships and achieving robust cross-modal integration. The fused features are classified for fault diagnosis under real-world conditions. Experiments show that CAVF-Net attains 99.2% accuracy with few iterations on clean data and maintains 95.42% accuracy in high-entropy multi-noise environments-outperforming single-model acoustic and vibration by 16.32% and 8.86%, respectively, while significantly reducing information uncertainty in downstream classification.