Abstract
This paper primarily addresses the challenges posed by the difficulties in directly measuring the fusion width at the bottom of the weld and in real-time monitoring of the penetration state during the groove welding process. It focuses on the research of online penetration state monitoring technology, which utilizes multi-modal signals such as sound and image during the welding process. The multimodal network proposed in this paper, SIMNet, first employs the short-time Fourier transform (STFT) to convert the original sound signal into the time-frequency domain for preliminary feature extraction. Secondly, a visual feature extractor based on an attention mechanism is used to extract image features. Meanwhile, a cosine similarity loss function is introduced to align the features of the two modalities in the semantic space before fusion. Finally, the interaction and fusion of features are achieved through a cross-attention mechanism. The experimental results demonstrate that SIMNet achieves the best performance with a mean squared error (MSE) of 0.1141 mm, compared to other mainstream algorithms. Furthermore, the inference speed with multimodal input reaches 60 frames per second (FPS), enabling quantitative and real-time multimodal fusion intelligent penetration state monitoring.