ViClickbait-2025: A comprehensive dataset for Vietnamese clickbait detection

ViClickbait-2025:越南语点击诱饵检测综合数据集

阅读:1

Abstract

ViClickbait-2025 is a curated Vietnamese-language dataset developed to facilitate research on automatic clickbait detection. It comprises 3414 headline samples collected through web scraping from eight major Vietnamese online news platforms between 2023 and 2025. Each headline is annotated as either clickbait or non-clickbait, with 31.2 % labeled as clickbait. The dataset includes nine key attributes, covering headline text, metadata, article summaries, and simulated engagement indicators. A preprocessing pipeline was applied to remove HTML noise, eliminate duplicates, and normalize the data. Annotation was carried out by three independent reviewers using a standardized guideline, with inter-annotator agreement reaching a Cohen's Kappa of 0.822. Disagreements were resolved by a fourth annotator, and inconclusive cases were excluded. The final dataset spans 13 news categories and is released in JSONL and CSV formats under a CC BY 4.0 license.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。