Spoof detection with dynamic learnable sparse attention and tri-modal fusion in resource-constrained audio systems

资源受限音频系统中基于动态可学习稀疏注意力机制和三模态融合的欺骗检测

阅读:2

Abstract

Audio sensors, essential for automatic speaker verification (ASV) systems, face growing threats from spoofed audio generated by advanced speech synthesis techniques. Traditional spoof detection methods, such as those based on computationally intensive Multi-Head Attention (MHA), suffer from quadratic complexity (O(T2)) and high memory demands, making them impractical for deployment on resource-constrained audio sensors. To address these limitations, we propose a novel Dynamic Learnable Sparse Attention (DLSA) framework that integrates Mel-Frequency Cepstral Coefficients (MFCC), Constant-Q Transform (CQT), and raw waveform modalities for spoof detection. The DLSA module introduces a learnable attention mechanism that dynamically selects key spectral and temporal features from MFCC and CQT for cross-modal fusion. A ResNet backbone is used to extract features from the raw waveform. We also introduce a hybrid loss function combining cross-entropy loss ([Formula: see text]) and center loss ([Formula: see text]), optimizing intra-class compactness and inter-class separability. Compared to MHA-based methods, our approach reduces computational costs by 80%. Experimental results on the ASVspoof 2019 Logical Access (LA) dataset demonstrate a significant performance boost, achieving an Equal Error Rate (EER) of 0.68% and a minimum tandem Detection Cost Function (t-DCF) of 0.0173, outperforming existing methods by 33.6% in EER reduction. This approach provides an efficient and robust solution for spoof detection in resource-constrained ASV systems.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。