Fast Probabilistic Whitening Transformation for Ultra-High Dimensional Genetic Data

针对超高维遗传数据的快速概率白化变换

阅读:1

Abstract

Statistical methods often make assumptions about independence between the samples or features of a dataset. Yet correlation structure is ubiquitous in real data, so these assumptions are often not met in practice. Whitening transformations are widely applied to remove this correlation structure. Existing approaches to whitening are based on standard linear algebra, rather than a probabilistic model, and application to high dimensional datasets with n samples and p features is problematic as p approaches or exceeds n . Moreover, the computational time becomes prohibitive since the naive transform is cubic in p . Here we propose a probabilistic model for data whitening and examine its properties based on first principles as p increases. We demonstrate the statistical properties of the probabilistic model and derive a remarkably efficient algorithm that is linear instead of cubic time in the number of features. We examine the out-of-sample performance of the probabilistic whitening model on simulated data, and real genotype data. In an application to impute z-statistics from unobserved genetic variants from a genome-wide association study of schizophrenia, the probabilistic whitening transformation, had the lowest mean square error while being up to an order of magnitude faster than other methods. Using this approach, we also identify tandem repeats that explain genetic regulatory signals for disease-relevant genes. Analyses are implemented in our novel open source R packages decorrelate and imputez.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。