Abstract
In this study, we propose a novel keyword spotting method that integrates a dynamic convolution model with a cross-frontend mutual learning strategy. The dynamic convolution model enables the adaptive capture of diverse and time-varying acoustic patterns, while the mutual learning strategy effectively leverages complementary features extracted from multiple audio frontends to enhance the model's generalization across different input conditions. Experimental results on the public Google Speech Commands dataset demonstrate that the proposed method achieves 97% accuracy with only 62K parameters and 6.11M FLOPs. Furthermore, the method shows strong robustness in noisy environments, maintaining reliable recognition performance even under challenging conditions with low signal-to-noise ratio. These findings highlight the efficiency and robustness of the proposed approach, making it a promising solution for real-world keyword spotting applications.