Abstract
Strip steel surface defect detection remains a challenging task due to the diverse scales and uneven spatial distribution of defects, which often lead to incomplete feature representation and missed detections in sparsely distributed regions. To address these challenges, we propose a novel cross-scale spatial-semantic feature aggregation network (CSSFAN) that achieves fine-grained and semantically consistent feature fusion across multiple scales. Specifically, CSSFAN adopts a bottom-up feature aggregation strategy equipped with a series of cross-scale spatial-semantic aggregation modules (CSSAMs). Each CSSAM first establishes a mapping relationship between high-level feature points and low-level feature regions and then introduces a cross-scale attention mechanism that adaptively injects spatial details from low-level features into high-level semantic representations. This aggregation strategy bridges the gap between spatial precision and semantic abstraction, enabling the network to capture subtle and irregular defect patterns. Furthermore, we introduce an adaptive region proposal network (ARPN) to cope with the uneven spatial distribution of defects. ARPN dynamically adjusts the number of region proposals according to the local feature complexity, ensuring that regions with dense or subtle defects receive more proposal attention, while sparse or background regions are adaptively suppressed, thereby enhancing the model's sensitivity to defect-prone areas. Extensive experiments on two strip steel surface defect datasets demonstrate that our method significantly improves detection performance, validating its effectiveness and robustness.