Abstract
While attention mechanisms significantly enhance feature representation in Convolutional Neural Networks (CNNs), existing approaches often suffer from limited receptive fields, insufficient directional modeling, and static fusion strategies that treat channel and spatial domains in isolation. To address these challenges, we propose the Dynamic Multi-Scale Channel-Spatial Attention (DMSCA) mechanism. This plug-and-play module synergistically integrates six cohesive components to achieve deep feature coupling. Specifically, DMSCA introduces Temperature-controlled Channel Attention (TCA) to dynamically regulate the sharpness of attention distributions via a learnable temperature parameter, and a Direction-aware Multi-scale Spatial Context Encoder (MSCE) that captures granular details across varying kernel sizes while preserving precise positional cues through orthogonal interaction. Crucially, unlike fixed-structure methods such as CBAM, our Dynamic Feature Fusion (DFF) employs a learnable gating mechanism to adaptively weight and fuse channel-spatial information based on pixel-wise input content. Extensive experiments on CIFAR-10/100 and ImageNet demonstrate that DMSCA consistently outperforms state-of-the-art attention mechanisms. Notably, it achieves a 1.52% Top-1 accuracy gain on ImageNet with a ResNet-50 backbone. Detailed analysis confirms that DMSCA offers superior robustness against image degradation and generalization capabilities with a modest computational trade-off (11.3% parameter and 2.4% FLOPs increase).