TopicForest: embedding-driven hierarchical clustering and labeling for biomedical literature

TopicForest:基于嵌入驱动的生物医学文献层次聚类和标注

阅读:1

Abstract

OBJECTIVE: The rapid expansion of biomedical literature necessitates effective approaches for organizing and interpreting complex research topics. Existing embedding-based topic modeling techniques provide flat clusters at single granularities, which ignores the reality of complex hierarchies of subjects. Our objective is to instead create a forest of topic trees, each of which start from a broad area and drill down to narrow specialties. METHODS: We propose TopicForest, a new embedding-driven hierarchical clustering and labeling framework that involves: (1) embedding biomedical abstracts within a high-dimensional semantic space using contrastively trained LLMs, (2) manifold learning to reduce dimensionality for visual interpretation, (3) hierarchical clustering via binary partitioning and multi-level dendrogram cutting, and (4) recursive LLM-based topic summarization to efficiently generate concise and coherent labels from the smallest clusters up to broad subjects covering thousands of publications. We construct a corpus comprising 24,366 biomedical abstracts from Scientific Reports, leveraging its human-curated topic hierarchy as gold-standard for evaluation. We evaluate clustering performance using Adjusted Mutual Information (AMI) and Dasgupta's cost, while labeling quality is evaluated based on diversity and hierarchical affinity. RESULTS: TopicForest's dendrogram cutting achieves AMI scores comparable to or better than flat embedding-based clustering methods such as BERTopic (with K-means or HDBSCAN) across multiple dimension-reduction strategies (t-SNE and UMAP), while uniquely providing multi-scale topic granularity. It also outperforms the deep hierarchical topic model HyperMiner, yielding higher AMI scores and comparable Dasgupta's costs. For labeling, the proposed LLM recursive labeling method surpasses both c-TF-IDF and HyperMiner, achieving higher label diversity and hierarchical affinity, while maintaining efficient token usage. Furthermore, TopicForest maintains stable clustering quality across different embedding models, demonstrating robustness and generalizability in hierarchical topic discovery. CONCLUSION: Through novel integration of LLMs, dimension reduction, and advanced hierarchical clustering techniques, TopicForest provides effective and interpretable hierarchical topic modeling for biomedical literature, facilitating multi-scale exploration and visualization of literature corpora.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。