Topic modeling revisited:  New evidence on algorithm performance and quality metrics

主题建模再探:算法性能和质量指标的新证据

阅读:1

Abstract

Topic modeling is a popular technique for exploring large document collections. It has proven useful for this task, but its application poses a number of challenges. First, the comparison of available algorithms is anything but simple, as researchers use many different datasets and criteria for their evaluation. A second challenge is the choice of a suitable metric for evaluating the calculated results. The metrics used so far provide a mixed picture, making it difficult to verify the accuracy of topic modeling outputs. Altogether, the choice of an appropriate algorithm and the evaluation of the results remain unresolved issues. Although many studies have reported promising performance by various topic models, prior research has not yet systematically investigated the validity of the outcomes in a comprehensive manner, that is, using more than a small number of the available algorithms and metrics. Consequently, our study has two main objectives. First, we compare all commonly used, non-application-specific topic modeling algorithms and assess their relative performance. The comparison is made against a known clustering and thus enables an unbiased evaluation of results. Our findings show a clear ranking of the algorithms in terms of accuracy. Secondly, we analyze the relationship between existing metrics and the known clustering, and thus objectively determine under what conditions these algorithms may be utilized effectively. This way, we enable readers to gain a deeper understanding of the performance of topic modeling techniques and the interplay of performance and evaluation metrics.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。