BanglaTense: A large-scale dataset of Bangla sentences categorized by tense: Past, present, and future

BanglaTense:一个按时态(过去时、现在时和将来时)分类的大规模孟加拉语句子数据集。

阅读:1

Abstract

Bengali, an Indo-Aryan language, features a complex grammatical structure with tenses, which is crucial for natural language processing (NLP) applications like text classification, machine translation, and sentiment analysis. The BanglaTense dataset is a large-scale, meticulously curated collection of Bangla sentences categorized by their tense: Past, present, and future. Addressing the resource gap in NLP for the Bangla language, BanglaTense provides a curated resource for Bangla sentence classification, featuring 17,819 annotated sentences, with 5,629 in the past tense, 6,101 in the present tense, and 6,089 in the future tense. This dataset is a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models while ensuring balanced representation across categories. Preprocessing steps are applied to enhance data quality, including anonymization and duplicate removal. Three native Bangla speakers independently assessed the tense labels of the sentences, ensuring the dataset's reliability. BanglaTense is designed to advance research and development in NLP for Bangla, offering valuable applications in tense detection, text classification, language modeling, and educational tools. This dataset supports linguistic study and enhances the development of precise and context-aware NLP models by providing a robust foundation for temporal analysis in Bangla sentences. The dataset is openly available for academic and research purposes, promoting collaboration and innovation within the Bangla NLP community.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。