Abstract
Hyperspectral target detection yet remains a fundamental challenging task in remote sensing due to high spectral dimensionality, significant background variability, scarcity of target samples, and severe class imbalance. To address these issues, we propose a novel framework named Self and Cross Attention Feature Fusion (SCAFF). The model follows a patch-based design and integrates two complementary branches: (i) an attention-driven branch that captures both intra-patch and inter-patch dependencies, and (ii) a convolutional branch that extracts local spatial-spectral patterns. By combining global contextual modeling with fine-grained local representation, the proposed architecture establishes a balance between global and local feature analysis. The key novelty lies in the joint use of Self-Attention and Cross-Attention to adaptively guide the network's focus toward potential target regions, mitigating class imbalance and enhancing discriminability. Furthermore, an enhanced version of the guided filter is suggested to refine spatial consistency of detection maps in the post-processing. Extensive experiments on four widely used benchmark datasets demonstrate that SCAFF achieves superior accuracy and stability compared to state-of-the-art methods, with up to 15% performance improvement in some cases. These results confirm effectiveness and robustness of the proposed method for practical hyperspectral target detection.