Abstract
Sylheti is a language spoken by about 11 million people worldwide. It's mostly spoken in northeastern Bangladesh and southern Assam, India, and by people living in other countries who originally came from these regions. Translating Sylheti dialects into Standard Bangla is essential to ensure effective communication across the country and internationally. This article introduces a collection of paired sentences, one in the Sylheti dialect and the other in Standard Bangla. It was created to enhance Neural Machine Translation (NMT) between the two languages. Sylheti is a language with a rich cultural heritage, known for its unique vocabulary, music, and folklore. However, it has been largely absent from formal written materials and digital resources, leaving a gap in its linguistic representation. To bridge this gap, 5002 sentence pairs were carefully collected from various sources, such as Bangladeshi newspapers, social media platforms, voluntary comments and contributions from native Sylheti speakers. The dataset, collected between December 2023 and March 2025, contains diverse linguistic elements. It includes 21,132 unique words (9729 Sylheti words and 11,403 Standard Bangla words), 10,340 clauses (5069 Sylheti and 5271 Standard Bangla), and 10,004 sentences. This collection is not only valuable for machine translation but also plays a crucial role in other areas of natural language processing. It supports tasks like text classification, identifying key names and entities, and analyzing sentiment. Furthermore, it enables the development of advanced technologies for Sylheti, such as text-to-speech systems, sentiment analysis tools, and language models. This resource is a significant step towards better understanding and utilizing the Sylheti language in the digital world.