A dataset for translating local Bangla (Sylheti) dialects into standard Bangla

用于将孟加拉语(锡尔赫特语)地方方言翻译成标准孟加拉语的数据集

阅读:1

Abstract

Sylheti is a language spoken by about 11 million people worldwide. It's mostly spoken in northeastern Bangladesh and southern Assam, India, and by people living in other countries who originally came from these regions. Translating Sylheti dialects into Standard Bangla is essential to ensure effective communication across the country and internationally. This article introduces a collection of paired sentences, one in the Sylheti dialect and the other in Standard Bangla. It was created to enhance Neural Machine Translation (NMT) between the two languages. Sylheti is a language with a rich cultural heritage, known for its unique vocabulary, music, and folklore. However, it has been largely absent from formal written materials and digital resources, leaving a gap in its linguistic representation. To bridge this gap, 5002 sentence pairs were carefully collected from various sources, such as Bangladeshi newspapers, social media platforms, voluntary comments and contributions from native Sylheti speakers. The dataset, collected between December 2023 and March 2025, contains diverse linguistic elements. It includes 21,132 unique words (9729 Sylheti words and 11,403 Standard Bangla words), 10,340 clauses (5069 Sylheti and 5271 Standard Bangla), and 10,004 sentences. This collection is not only valuable for machine translation but also plays a crucial role in other areas of natural language processing. It supports tasks like text classification, identifying key names and entities, and analyzing sentiment. Furthermore, it enables the development of advanced technologies for Sylheti, such as text-to-speech systems, sentiment analysis tools, and language models. This resource is a significant step towards better understanding and utilizing the Sylheti language in the digital world.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。