Abstract
Detecting estrus in sows is important for improving pig reproductive performance and pig farm production efficiency levels. Traditional estrus detection methods are highly subjective and inaccurate, making it difficult to meet the demands of modern farming. This research developed a multimodal feature fusion method that combines audio and thermal infrared image data to enhance the accuracy and robustness of estrus monitoring in breeding pigs. We designed the Adaptive-PIG-OESTUS-CNN-ViT model, which uses thermal infrared images and audio as inputs for a network model. By integrating the Vision Transformer and convolutional neural networks, the model extracted and fused features from multimodal data. An adaptive cross-attention mechanism was employed to automatically learn feature vectors representing the combined thermal infrared and audio data, which were then fed into an improved DenseNet network to identify estrus and non-estrus states in breeding pigs. The model achieved an accuracy of 98.92%, a recall rate of 95.83%, and an F1-score of 97.35%, effectively performing non-destructive estrus detection in breeding pigs. Compared with traditional estrus detection methods, this approach more accurately integrated data from different modalities to distinguish the estrus state of breeding pigs, providing an efficient, objective, and non-destructive means for sow estrus detection.