Speech recognition datasets for low-resource Congolese languages

针对资源匮乏的刚果语言的语音识别数据集

阅读:1

Abstract

Large pre-trained Automatic Speech Recognition (ASR) models have shown improved performance in low-resource languages due to the increased availability of benchmark corpora and the advantages of transfer learning. However, only a limited number of languages possess ample resources to fully leverage transfer learning. In such contexts, benchmark corpora become crucial for advancing methods. In this article, we introduce two new benchmark corpora designed for low-resource languages spoken in the Democratic Republic of the Congo: the Lingala Read Speech Corpus, with 4 h of labelled audio, and the Congolese Speech Radio Corpus, which offers 741 h of unlabelled audio spanning four significant low-resource languages of the region. During data collection, Lingala Read Speech recordings of thirty-two distinct adult speakers, each with a unique context under various settings with different accents, were recorded. Concurrently, Congolese Speech Radio raw data were taken from the archive of broadcast station, followed by a designed curation process. During data preparation, numerous strategies have been utilised for pre-processing the data. The datasets, which have been made freely accessible to all researchers, serve as a valuable resource for not only investigating and developing monolingual methods and approaches that employ linguistically distant languages but also multilingual approaches with linguistically similar languages. Using techniques such as supervised learning and self-supervised learning, they are able to develop inaugural benchmarking of speech recognition systems for Lingala and mark the first instance of a multilingual model tailored for four Congolese languages spoken by an aggregated population of 95 million. Moreover, two models were applied to this dataset. The first is supervised learning modelling and the second is for self-supervised pre-training.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。