The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports â1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with â1/â2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.
阅读:4
作者:Qian Junyang, Tanigawa Yosuke, Du Wenfei, Aguirre Matthew, Chang Chris, Tibshirani Robert, Rivas Manuel A, Hastie Trevor
| 期刊: | PLoS Genetics | 影响因子: | 3.700 |
| 时间: | 2020 | 起止号: | 2020 Oct 23; 16(10):e1009141 |
| doi: | 10.1371/journal.pgen.1009141 | ||
特别声明
1、本文转载旨在传播信息,不代表本网站观点,亦不对其内容的真实性承担责任。
2、其他媒体、网站或个人若从本网站转载使用,必须保留本网站注明的“来源”,并自行承担包括版权在内的相关法律责任。
3、如作者不希望本文被转载,或需洽谈转载稿费等事宜,请及时与本网站联系。
4、此外,如需投稿,也可通过邮箱info@biocloudy.com与我们取得联系。
