Tibetan-Chinese speech-to-speech translation based on discrete units

基于离散单元的藏汉语音翻译

阅读:1

Abstract

Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS), to end-to-end models. This evolution has been driven by advancements in model performance and the expansion of cross-lingual speech datasets. Despite the paucity of research on Tibetan speech translation, this paper endeavors to tackle the challenge of Tibetan-to-Chinese direct speech-to-speech translation within the multi-task learning framework, employing self-supervised learning (SSL) and sequence-to-sequence model training. Leveraging HuBERT model to extract discrete units of target speech, we develop a speech-to-unit translation (S2UT) model using an encoder-decoder architecture which subsequently generates speech output through a unit-based vocoder. By employing SSL and utilizing discrete representations as training targets, our approach effectively captures linguistic differences, facilitating direct translation between the two languages. We evaluate the performance of HuBERT model under various configurations to select the optimal setup based on Phone-unit Normalized Mutual Information (PNMI) values. After fine-tuning the chosen HuBERT model on specific corpora, we introduce auxiliary tasks to enhance translation performance. This underscores the pivotal role of multi-task learning in improving overall model efficacy. Experimental results validate the feasibility of Tibetan-to-Chinese S2ST, demonstrating promising translation quality and semantic content preservation, despite limited data availability.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。