Abstract
Remote sensing imagery (RSI) object detection is critical to many applications, yet mainstream detectors analyse only spatial features and, because of spectral bias, fail to learn high-frequency information adequately, resulting in performance bottlenecks under cluttered backgrounds, distractors, and multi-scale targets, especially small ones. To break these limitations, we propose MSRS-DETR, an end-to-end framework that deeply fuses spatial and frequency cues. The approach introduces three key innovations: (1) C2f(FAT)NET, a frequency-attention-enhanced lightweight residual backbone that provides richer dual-domain features with fewer parameters; (2) an Entanglement Transformer Block (ETB) in the encoder that refines deep semantics via cross-domain frequency-spatial interaction and suppresses background interference; and (3) S2-CCFF, a shallow-feature-extended bidirectional fusion path that markedly improves the retention and utilisation of fine details for small objects. Experiments on HRSC2016 and ShipRSImageNet demonstrate the effectiveness and generalisation of this spatial-frequency paradigm: relative to the baseline, MSRS-DETR reduces parameters by 29.1%, boosts inference speed by 12.4% and 8.4%, and raises mAP(50-95) by 1.69% and 2.16%, respectively.