Tackling toxicity in Arabic social media through advanced detection techniques

利用先进的检测技术应对阿拉伯社交媒体中的有害内容

阅读:1

Abstract

Online social networks are currently the most widely utilized interactive media for interpersonal communication, emotional expression, and information sharing. Despite the helpful and fascinating content, unfortunately, inappropriate or abusive content, such as toxicity, hate speech, and insults, can occasionally be shared on social networks. Any kind of online abuse, including but not limited to cyberbullying, discrimination, abusive language, profanity, flames, hate speech, and harassment, is considered toxic content. While there has been little effort in the Arabic language, the majority of toxicity detection attempts have focused on English text. In this work, we constructed a standard Arabic dataset that can be used for toxicity and abuse detection on OSNs. The proposed dataset has been annotated by the experts of five native and fluent Arabic speakers and linguists. To evaluate the performance of our dataset, we conducted a series of experiments by using sixteen machine learning algorithms, the FastText model, and seven transfer learning architectures to compare the performance. Furthermore, we used four word embedding techniques (bag of words (BOW), term frequency-inverse document frequency (TF-IDF), FASTTEXT, and bidirectional encoder representations from transformers (BERT)). Our experimental results demonstrated that the fine-tuned MARBERTv2 model with BERT embedding outperforms the other models, achieving an F1-score of 92.43% and an accuracy of 92.21%. Notably, this study highlights the importance of addressing toxicity on social media platforms, considering diverse languages and cultures. This signifies a significant breakthrough in the classification of toxic tweets in Arabic.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。