Abstract
Monitoring the safety of underground work environments is a fundamental aspect of industrial safety production. Detecting safety helmet wearing, as a critical task, poses numerous challenges, including interference caused by complex backgrounds, low-light conditions, and insufficient detection accuracy achieved for small targets. Existing methods exhibit limitations in terms of multiscale feature fusion, foreground localization accuracy, and dynamic context modeling. Traditional feature pyramids struggle to resolve cross-scale feature conflicts, whereas fixed dilated convolutions lack adaptability in complex scenes. To address these issues, a YOLOv11-SRA model based on the YOLOv11 architecture is proposed in this paper; the model is integrated with a three-stage optimization strategy: the SAConv module dynamically adjusts the dilation rates to capture multiscale contextual information, thereby enhancing the robustness of small target detection. The RCM employs rectangular self-calibrated attention to refine the foreground region, improving the boundary localization capabilities of the model. The ASFF module fuses multiscale features through adaptive spatial weighting to alleviate feature conflicts. The effectiveness of the algorithm was validated based on a publicly available underground safety helmet dataset, i.e., CUMT-HelmeT, where it achieved a mean average precision at an IoU of 0.50 (mAP50) of 84.2% and a recall of 79.9%, significantly outperforming the mainstream models.