CAFE: Spontaneous code-switching speech dataset in Algerian dialect, French and English

CAFE:阿尔及利亚方言、法语和英语的自发语码转换语音数据集

阅读:2

Abstract

Publicly available datasets capturing spontaneous multilingual speech-especially those involving code-switching between Algerian Arabic, French, and English-are critically scarce. This lack of resources hinders the development of automatic speech recognition (ASR) and multilingual NLP systems for low-resource languages and under-represented Arabic dialects. We introduce CAFE, a novel dataset comprising approximately 37 h of spontaneous, in vivo human-human conversations among 100+ speakers across Algeria. The dialogues cover diverse everyday topics such as sports, science, and technology, and exhibit a rich range of natural conversational phenomena, including explicit code-switching, overlapping speech, non-lexical vocalizations (e.g., laughter, fillers, ambient noise), and dialectal variation reflecting Algeria's sociolinguistic landscape. CAFE is released in two tiers: CAFE-small (2 h 36 m): A fully human-annotated subset with high-quality transcriptions, vocal event labels, and linguistic annotations, supporting ASR evaluation, NLP tasks, and code-switching analysis. CAFE-large (∼34 h 35 m): The remainder of the corpus, automatically labeled, suitable for pretraining and semi-supervised learning. To support controlled experiments, CAFE-small includes two curated subsets: (i) CAFE-small-clean (2 h 18 m): Contains utterances with no overlapping speech. (ii) CAFE-small-overlap (17 m): Contains 23 files with overlap segments and timestamps. The dataset also provides rich metadata, including audio chunk IDs, dialect labels, and both raw and linguistically processed transcripts. CAFE offers a valuable resource for advancing ASR, dialect identification, and sociolinguistic analysis in multilingual and low-resource settings.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。