Abstract
Audio sensors, essential for automatic speaker verification (ASV) systems, face growing threats from spoofed audio generated by advanced speech synthesis techniques. Traditional spoof detection methods, such as those based on computationally intensive Multi-Head Attention (MHA), suffer from quadratic complexity (O(T2)) and high memory demands, making them impractical for deployment on resource-constrained audio sensors. To address these limitations, we propose a novel Dynamic Learnable Sparse Attention (DLSA) framework that integrates Mel-Frequency Cepstral Coefficients (MFCC), Constant-Q Transform (CQT), and raw waveform modalities for spoof detection. The DLSA module introduces a learnable attention mechanism that dynamically selects key spectral and temporal features from MFCC and CQT for cross-modal fusion. A ResNet backbone is used to extract features from the raw waveform. We also introduce a hybrid loss function combining cross-entropy loss ([Formula: see text]) and center loss ([Formula: see text]), optimizing intra-class compactness and inter-class separability. Compared to MHA-based methods, our approach reduces computational costs by 80%. Experimental results on the ASVspoof 2019 Logical Access (LA) dataset demonstrate a significant performance boost, achieving an Equal Error Rate (EER) of 0.68% and a minimum tandem Detection Cost Function (t-DCF) of 0.0173, outperforming existing methods by 33.6% in EER reduction. This approach provides an efficient and robust solution for spoof detection in resource-constrained ASV systems.