Abstract
Strawberries are important cash crops. Traditional manual picking is costly and inefficient, while automated harvesting robots are hindered by field challenges like stem-leaf occlusion, fruit overlap, and appearance/maturity variations from lighting and viewing angles. To address the need for accurate cross-maturity fruit identification and keypoint detection, this study constructed a strawberry image dataset covering multiple varieties, ripening stages, and complex ridge-cultivation field conditions: MSRBerry. Based on the YOLO11-pose framework, we proposed DHN-YOLO with three key improvements: replacing the original C2PSA with the CDC module to enhance subtle feature capture and irregular shape adaptability; substituting C3K2 with C3H to strengthen multi-scale feature extraction and robustness to lighting-induced maturity/color variations; and upgrading the neck into a New-Neck via CA and dual-path fusion to reduce feature loss and improve critical region perception. These modifications enhanced feature quality while cutting parameters and accelerating inference. Experimental results showed DHN-YOLO achieved 87.3% precision, 88% recall, and 78.6% mAP@50:95 for strawberry detection (0.9%, 1.6%, 5% higher than YOLO11-pose), and 83%, 87.5%, 83.6% for keypoint detection (1.9%, 2.1%, 4.6% improvements). It also reached 71.6 FPS with 15 ms single-image inference. The overall performance of DHN-YOLO also surpasses other mainstream models such as YOLO13, YOLO10, DETR and so on. This demonstrates DHN-YOLO meets practical needs for robust strawberry and picking point detection in complex agricultural environments.