A Hyperparameter-Free, Fast and Efficient Framework to Detect Clusters From Limited Samples Based on Ultra High-Dimensional Features

一种基于超高维特征的、无需超参数、快速高效的有限样本聚类检测框架

阅读:1

Abstract

Clustering is a challenging problem in machine learning in which one attempts to group N objects into K(0) groups based on P features measured on each object. In this article, we examine the case where N ≪ P and K(0) is not known. Clustering in such high dimensional, small sample size settings has numerous applications in biology, medicine, the social sciences, clinical trials, and other scientific and experimental fields. Whereas most existing clustering algorithms either require the number of clusters to be known a priori or are sensitive to the choice of tuning parameters, our method does not require the prior specification of K(0) or any tuning parameters. This represents an important advantage for our method because training data are not available in the applications we consider (i.e., in unsupervised learning problems). Without training data, estimating K(0) and other hyperparameters-and thus applying alternative clustering algorithms-can be difficult and lead to inaccurate results. Our method is based on a simple transformation of the Gram matrix and application of the strong law of large numbers to the transformed matrix. If the correlation between features decays as the number of features grows, we show that the transformed feature vectors concentrate tightly around their respective cluster expectations in a low-dimensional space. This result simplifies the detection and visualization of the unknown cluster configuration. We illustrate the algorithm by applying it to 32 benchmarked microarray datasets, each containing thousands of genomic features measured on a relatively small number of tissue samples. Compared to 21 other commonly used clustering methods, we find that the proposed algorithm is faster and twice as accurate in determining the "best" cluster configuration.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。