Abstract
OBJECTIVE: The rapid expansion of biomedical literature necessitates effective approaches for organizing and interpreting complex research topics. Existing embedding-based topic modeling techniques provide flat clusters at single granularities, which ignores the reality of complex hierarchies of subjects. Our objective is to instead create a forest of topic trees, each of which start from a broad area and drill down to narrow specialties. METHODS: We propose TopicForest, a new embedding-driven hierarchical clustering and labeling framework that involves: (1) embedding biomedical abstracts within a high-dimensional semantic space using contrastively trained LLMs, (2) manifold learning to reduce dimensionality for visual interpretation, (3) hierarchical clustering via binary partitioning and multi-level dendrogram cutting, and (4) recursive LLM-based topic summarization to efficiently generate concise and coherent labels from the smallest clusters up to broad subjects covering thousands of publications. We construct a corpus comprising 24,366 biomedical abstracts from Scientific Reports, leveraging its human-curated topic hierarchy as gold-standard for evaluation. We evaluate clustering performance using Adjusted Mutual Information (AMI) and Dasgupta's cost, while labeling quality is evaluated based on diversity and hierarchical affinity. RESULTS: TopicForest's dendrogram cutting achieves AMI scores comparable to or better than flat embedding-based clustering methods such as BERTopic (with K-means or HDBSCAN) across multiple dimension-reduction strategies (t-SNE and UMAP), while uniquely providing multi-scale topic granularity. It also outperforms the deep hierarchical topic model HyperMiner, yielding higher AMI scores and comparable Dasgupta's costs. For labeling, the proposed LLM recursive labeling method surpasses both c-TF-IDF and HyperMiner, achieving higher label diversity and hierarchical affinity, while maintaining efficient token usage. Furthermore, TopicForest maintains stable clustering quality across different embedding models, demonstrating robustness and generalizability in hierarchical topic discovery. CONCLUSION: Through novel integration of LLMs, dimension reduction, and advanced hierarchical clustering techniques, TopicForest provides effective and interpretable hierarchical topic modeling for biomedical literature, facilitating multi-scale exploration and visualization of literature corpora.