An integrated clustering and BERT framework for improved topic modeling

一种用于改进主题建模的集成聚类和BERT框架

阅读:1

Abstract

Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。