Fast2Vec, a modified model of FastText that enhances semantic analysis in topic evolution

Fast2Vec 是 FastText 的一个改进模型,它增强了主题演化中的语义分析。

阅读:1

Abstract

BACKGROUND: Topic modeling approaches, such as latent Dirichlet allocation (LDA) and its successor, the dynamic topic model (DTM), are widely used to identify specific topics by extracting words with similar frequencies from documents. However, these topics often require manual interpretation, which poses challenges in constructing semantics topic evolution, mainly when topics contain negations, synonyms, or rare terms. Neural network-based word embeddings, such as Word2vec and FastText, have advanced semantic understanding but have their limitations. Word2Vec struggles with out-of-vocabulary (OOV) words, and FastText generates suboptimal embeddings for infrequent terms. METHODS: This study introduces Fast2Vec, a novel model that integrates the semantic capabilities of Word2Vec with the subword analysis strength of FastText to enhance semantic analysis in topic modeling. The model was evaluated using research abstracts from the Science and Technology Index (SINTA) journal database and validated using twelve public word similarity benchmarks, covering diverse semantic and syntactic dimensions. Evaluation metrics include Spearman and Pearson correlation coefficients to assess the alignment with human judgments. RESULTS: Experimental findings demonstrated that Fast2Vec outperforms or closely matches Word2Vec and FastText across most benchmark datasets, particularly in task requiring fine-grained semantic similarity. In OOV scenarios, Fast2Vec improved semantic similarity by 39.64% compared to Word2Vec, and 6.18% compared to FastText. Even in scenarios without OOV terms, Fast2Vec achieved a 7.82% improvement over FastText and a marginal 0.087% improvement over Word2Vec. Additionally, the model effectively categorized topics into four distinct evolution patterns (diffusion, shifting, moderate fluctuations, and stability), enabling a deeper understanding of evolution topic interests and their dynamic characteristics. CONCLUSION: Fast2Vec presents a robust and generalizable word embedding framework for semantic-based topic modeling. By combining the contextual sensitivity of Word2Vec with the subword flexibility of FastText, Fast2Vec effectively addresses prior limitations in handling OOV terms and semantic variation and demonstrates strong potential for boarder applications in natural language processing tasks.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。