Abstract
Remote sensing images are essential in various fields, but their high-resolution (HR) acquisition is often limited by factors such as sensor resolution and high costs. To address this challenge, we propose the Multi-image Remote Sensing Super-Resolution with Enhanced Spatio-temporal Feature Interaction Fusion Network ([Formula: see text]N). This model is a deep neural network based on end-to-end. The main innovations of the [Formula: see text]N network model include the following aspects. Firstly, through the Attention-Based Feature Encoder (ABFE) module, the spatial features of low-resolution (LR) images are precisely extracted. Combined with the Channel Attention Block (CAB) module, global information guidance and weighting are provided for the input features, effectively strengthening the spatial feature extraction capability of ABFE. Secondly, in terms of temporal feature modeling, we designed the Residual Temporal Attention Block (RTAB). This module effectively weights k LR images of the same location captured at different times via a global residual temporal connection mechanism, fully exploiting their similarities and temporal dependencies, and enhancing the cross-layer information transmission. The ConvGRU-RTAB Fusion Module (CRFM) captures the temporal features using RTAB based on ABFE and fuses the spatial and temporal features. Finally, the Decoder module enlarges the resolution of the fused features to achieve high quality super resolution image reconstruction. The comparative experiment results show that our model achieves notable improvements in the cPSNR metric, with values of 49.69 dB and 51.57 dB in the NIR and RED bands of the PROBA-V dataset, respectively. The visual quality of the reconstructed images surpasses that of state-of-the-art methods, including TR-MISR and MAST etc.