Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

利用 SepFormer 和分层注意力网络模型结合多阶段迁移学习增强构音障碍语音识别

阅读:1

Abstract

Dysarthria, a motor speech disorder that impacts articulation and speech clarity, presents significant challenges for Automatic Speech Recognition (ASR) systems. This study proposes a groundbreaking approach to enhance the accuracy of Dysarthric Speech Recognition (DSR). A primary innovation lies in the integration of the SepFormer-Speech Enhancement Generative Adversarial Network (S-SEGAN), an advanced generative adversarial network tailored for Dysarthric Speech Enhancement (DSE), as a front-end processing stage for DSR systems. The S-SEGAN integrates SEGAN's adversarial learning with SepFormer speech separation capabilities, demonstrating significant improvements in performance. Furthermore, a multistage transfer learning approach is employed to assess the DSR models for both word-level and sentence-level DSR. These DSR models are first trained on a large speech dataset (LibriSpeech) and then fine-tuned on dysarthric speech data (both isolated and augmented). Evaluations demonstrate significant DSR accuracy improvements in DSE integration. The Dysarthric Speech (DS)-baseline models (without DSE), Transformer and Conformer achieved Word Recognition Accuracy (WRA) percentages of 68.60% and 69.87%, respectively. The introduction of Hierarchical Attention Network (HAN) with the Transformer and Conformer architectures resulted in improved performance, with T-HAN achieving a WRA of 71.07% and C-HAN reaching 73%. The Transformer model with DSE + DSR for isolated words achieves a WRA of 73.40%, while that of the Conformer model reaches 74.33%. Notably, the T-HAN and C-HAN models with DSE + DSR demonstrate even more substantial enhancements, with WRAs of 75.73% and 76.87%, respectively. Augmenting words further boosts model performance, with the Transformer and Conformer models achieving WRAs of 76.47% and 79.20%, respectively. Remarkably, the T-HAN and C-HAN models with DSE + DSR and augmented words exhibit WRAs of 82.13% and 84.07%, respectively, with C-HAN displaying the highest performance among all proposed models.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。