Covid-19 vaccine hesitancy: Text mining, sentiment analysis and machine learning on COVID-19 vaccination Twitter dataset

新冠疫苗犹豫:基于新冠疫苗接种推特数据集的文本挖掘、情感分析和机器学习

阅读:1

Abstract

In 2019 there was an outbreak of coronavirus pandemic also known as COVID-19. Many scientists believe that the pandemic originated from Wuhan, China, before spreading to other parts of the globe. To reduce the spread of the disease, decision makers encouraged measures such as hand washing, face masking, and social distancing. In early 2021, some countries including the United States began administering COVID-19 vaccines. Vaccination brought a relief to the public; it also generated a lot of debates from anti-vaccine and pro-vaccine groups. The controversy and debate surrounding COVID-19 vaccine influenced the decision of several people in either to accept or reject vaccination. Because of data limitations, social media data, collected through live streaming public tweets using an Application Programming Interface (API) search, is considered a viable and reliable resource to study the opinion of the public on Covid-19 vaccine hesitancy. Thus, this study examines 3 sentiment computation methods (Azure Machine Learning, VADER, and TextBlob) to analyze COVID-19 vaccine hesitancy. Five learning algorithms (Random Forest, Logistics Regression, Decision Tree, LinearSVC, and Naïve Bayes) with different combination of three vectorization methods (Doc2Vec, CountVectorizer, and TF-IDF) were deployed. Vocabulary normalization was threefold; potter stemming, lemmatization, and potter stemming with lemmatization. For each vocabulary normalization strategy, we designed, developed, and evaluated 42 models. The study shows that Covid-19 vaccine hesitancy slowly decreases over time; suggesting that the public gradually feels warm and optimistic about COVID-19 vaccination. Moreover, combining potter stemming and lemmatization increased model performances. Finally, the result of our experiment shows that TextBlob + TF-IDF + LinearSVC has the best performance in classifying public sentiment into positive, neutral, or negative with an accuracy, precision, recall and F1 score of 0.96752, 0.96921, 0.92807 and 0.94702 respectively. It means that the best performance was achieved when using TextBlob sentiment score, with TF-IDF vectorization and LinearSVC classification model. We also found out that combining two vectorizations (CountVectorizer and TF-IDF) decreases model accuracy.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。