High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN

基于去噪扩散生成对抗网络（GAN）的高保真零样本说话人自适应文本到语音合成

阅读：1

作者：Liu,Xiangchun,Ma,Xuan,Song,Wei,Zhang,Yanghao,Zhang,Yi

期刊：	Scientific Reports	影响因子：	3.900
时间：	2025	起止号：	2025 Feb 20;15(1):6269
doi：	10.1038/s41598-025-90507-0	靶点：	GAN

Abstract

Zero-shot speaker adaptation seeks to enable the cloning of voices for previously unseen speakers by leveraging only a few seconds of their speech samples. Nevertheless, existing zero-shot multi-speaker text-to-speech (TTS) systems continue to exhibit significant disparities in the synthesized speech quality and speaker similarity when comparing unseen to seen speakers. To address these challenges and improve synthesized speech quality and speaker similarity for unseen speakers, this study introduces an efficient zero-shot speaker-adaptive TTS model, DiffGAN-ZSTTS. The model is constructed on the FastSpeech2 framework and utilizes a diffusion-based decoder to enhance the model's generalization ability for unseen speaker samples in zero-shot settings. We present the SE-Res2FFT module, which refines the encoder's FFT block by incorporating SE-Res2Net modules in parallel with the multi-head self-attention mechanism, thereby achieving a balanced extraction of local and global features. Furthermore, we introduce the MHSE module, which employs multi-head attention mechanisms to augment the model's capability in representing speaker reference audio features. The model was trained and evaluated using both the AISHELL3 and LibriTTS datasets, providing a comprehensive evaluation of speech synthesis performance across both seen and unseen speaker conditions in Chinese and English. Experimental results indicate that DiffGAN-ZSTTS substantially improves both the synthesized speech quality and speaker similarity. Additionally, we assessed the model's performance on the Baker and VCTK datasets, which are outside the training domain, and the results reveal that the model can successfully perform zero-shot speech synthesis for unseen speakers with only a few seconds of speech, outperforming state-of-the-art models in both speaker similarity and audio quality.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。