Abstract
Intelligent bird species recognition is vital for biodiversity monitoring and ecological conservation. This study tackles the challenge of declining recognition accuracy caused by occlusions and imaging noise in complex natural environments. Focusing on ten representative bird species from the Dongting Lake Wetland, we propose an improved YOLOv11n-based model named MSFN-YOLO11, which incorporates multi-scale feature fusion. After selecting YOLOv11n as the baseline through comparison with the most-stable version of YOLOv8n, we enhance its backbone by introducing an MSFN module. This module strengthens global and local feature extraction via parallel dilated convolution and a channel attention mechanism. Experiments are conducted on a self-built dataset containing 4540 images of ten species with 6824 samples. To simulate real-world conditions, 25% of samples are augmented using random occlusion, Gaussian noise (σ = 0.2, 0.3, 0.4), and Poisson noise. The improved model achieves a mAP@50 of 96.4% and mAP@50-95 of 83.2% on the test set. Although the mAP@50 shows a slight improvement of 0.3% compared to the original YOLOv11, it has contributed to an 18% reduction in training time. Furthermore, it also demonstrates practical efficacy in processing dynamic video, attaining an average 63.1% accuracy at 1920 × 1080@72fps on an NVIDIA_Tesla_V100_SXM2_32_GB. The proposed model provides robust technical support for real-time bird monitoring in wetlands and enhances conservation efforts for endangered species.