Abstract
Publicly available datasets capturing spontaneous multilingual speech-especially those involving code-switching between Algerian Arabic, French, and English-are critically scarce. This lack of resources hinders the development of automatic speech recognition (ASR) and multilingual NLP systems for low-resource languages and under-represented Arabic dialects. We introduce CAFE, a novel dataset comprising approximately 37 h of spontaneous, in vivo human-human conversations among 100+ speakers across Algeria. The dialogues cover diverse everyday topics such as sports, science, and technology, and exhibit a rich range of natural conversational phenomena, including explicit code-switching, overlapping speech, non-lexical vocalizations (e.g., laughter, fillers, ambient noise), and dialectal variation reflecting Algeria's sociolinguistic landscape. CAFE is released in two tiers: CAFE-small (2 h 36 m): A fully human-annotated subset with high-quality transcriptions, vocal event labels, and linguistic annotations, supporting ASR evaluation, NLP tasks, and code-switching analysis. CAFE-large (∼34 h 35 m): The remainder of the corpus, automatically labeled, suitable for pretraining and semi-supervised learning. To support controlled experiments, CAFE-small includes two curated subsets: (i) CAFE-small-clean (2 h 18 m): Contains utterances with no overlapping speech. (ii) CAFE-small-overlap (17 m): Contains 23 files with overlap segments and timestamps. The dataset also provides rich metadata, including audio chunk IDs, dialect labels, and both raw and linguistically processed transcripts. CAFE offers a valuable resource for advancing ASR, dialect identification, and sociolinguistic analysis in multilingual and low-resource settings.