Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Therefore, there is an urgent need to elaborate methods for explaining the outcome of the clustering. We have constructed in this paper a theoretical bridge linking the clusters resulting from Graph Spectral Clustering and the actual document content, given that similarities between documents are computed as cosine measures in tf or tfidf representation. This link enables to provide with explanation of cluster membership in clusters produced by GSA. We present a proposal of explanation of the results of combinatorial and normalized Laplacian based graph spectral clustering. For this purpose, we show (approximate) equivalence of combinatorial Laplacian embedding and of K-embedding (proposed in this paper) and term vector space embedding. We performed an experimental study showing that K-embedding approximates well Laplacian embedding under favourable block matrix conditions and show that approximation is good enough under other conditions. We show also perfect equivalence of normalized Laplacian embedding and the [Formula: see text]-embedding (proposed in this paper) and (weighted) term vector space embedding. Hence a bridge is constructed between the textual contents and the clustering results using both combinatorial and normalized Laplacian based Graph Spectral Clustering methods. We provide a theoretical background for our approach. An initial version of this paper is available at arXiv, (Starosta B 2023). The Reader may refer to that text to get acquainted with formal aspects of our method and find a detailed overview of motivation.
Explainable Graph Spectral Clustering of text documents.
阅读:6
作者:Starosta BartÅomiej, KÅopotek MieczysÅaw A, WierzchoÅ SÅawomir T, Czerski Dariusz, Sydow Marcin, Borkowski Piotr
| 期刊: | PLoS One | 影响因子: | 2.600 |
| 时间: | 2025 | 起止号: | 2025 Feb 4; 20(2):e0313238 |
| doi: | 10.1371/journal.pone.0313238 | ||
特别声明
1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。
2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。
3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。
4、投稿及合作请联系:info@biocloudy.com。
