Abstract
Many studies have established that the attention mechanism has great potential in improving the performance of Convolutional Neural Networks (CNNs) in image classification problems in recent years. Combining channel and spatial attention modules is one of the different kinds of attention mechanisms that are inspired by the visual perception of the human brain. So far, no paper has considered both parallel and sequential states of combining channel-spatial attention modules, so that while comparing them comprehensively and accurately, it can be definitively said which of them is more optimal in terms of a better balance between efficiency and computational complexity of the model. In this paper, we introduced two new types of channel-spatial attention modules, the Parallel Channel-spatial Attention Module (PCSAM) and the Sequential Channel-spatial Attention Module (SCSAM), to embed in the architecture of any CNN. Each of the proposed attention modules is composed of a channel and spatial attention sub-modules. The Channel Attention Module (CAM) and Spatial Attention Module (SAM) help the network in extracting the channels related to the architecture of the Region of Interest (RoI) and its location in the input feature maps, respectively. We increase the representation power of the attention-based networks by extracting the features using Global Average Pooling (GAP) and Global Maximum Pooling (GMP) in the CAM and SAM. Also, the Dilation Convolution (DC) layer is employed in the structure of the SAM instead of the standard convolution to better focus on the RoI in the feature maps. The PCSAM and SCSAM are implemented in the architecture of the ResNet18 and MobileNetv4 to produce the ResNet18PCSAM, ResNet18SCSAM, MobileNetv4PCSAM, and MobileNetv4SCSAM. All networks are trained and evaluated on three general image classification datasets, the CIFAR-10, CIFAR-100, and Tiny-ImageNet, with the same experimental conditions for 50 epochs. The classification results in the test step show that the MobileNetv4SCSAM has a better efficiency than other architectures on all datasets. It also achieved higher performance than the previous existing channel-spatial attention modules.