Abstract
High-resolution remote sensing image scene classification (HRRSI-SC) is crucial for obtaining accurate Earth surface information. However, the task remains challenging due to significant background interference, high intra-class variation, and subtle inter-class similarities. Convolutional neural networks (CNNs) are constrained by their local receptive fields, which limits their ability to capture long-range spatial dependencies. On the other hand, Vision Transformers (e.g., ViT-B-16) excel at global feature extraction but often suffer from high computational complexity and may lack the inherent inductive biases for local feature modeling that CNNs possess. To address these limitations, this paper proposes a cross-level feature complementary classification framework based on Lie Group manifold space, termed CBCAM-LGM. Within the proposed CBCAM-LGM framework, multi-granularity features are first distilled via a global average pooling layer to suppress redundant information. The core of our approach, the cross-level bidirectional complementary attention module (CBCAM), then enables the adaptive fusion of features from both branches through a cross-query attention mechanism. Furthermore, by employing parallel dilated convolutions and a parameter-sharing strategy, the model captures multi-scale contextual information by sharing a single set of convolutional weights, which reduces the computational complexity to merely 1.21 GMACs while preserving multi-scale representation with minimal parameter overhead. Extensive experiments on challenging benchmarks demonstrate the model's efficacy, as it achieves a state-of-the-art classification accuracy of 97.81% on the AID, surpassing the ViT-B-16 baseline by 1.63%, while containing only 11.237 million parameters (an 87% reduction). These results collectively affirm that our model presents an efficient solution characterized by high accuracy and low complexity.