Estimating population structure using epigenome-wide methylation data

利用全基因组甲基化数据估计群体结构

阅读:1

Abstract

INTRODUCTION: In epigenome-wide association analysis (EWAS), unaddressed population stratification often leads to inflation. We aimed to compute methylation population scores (MPSs) that predict genetic principal components (GPCs) using a feature selection and regression approach. METHODS: We used multi-ethnic methylation data (Illumina 450K/EPIC array) from unrelated MESA (n=929), CARDIA (n=1123), JHS (n=1365), ARIC (n=2338), and HCHS/SOL (n=1475) individuals, randomly assigning 85% of participants from each cohort to a training dataset and the remaining 15% to a test dataset. First, we estimated the associations of GPCs with each available CpG methylation site using linear regression within each cohort, adjusting for age, sex, smoking status, race/ethnic background (as a proxy for background information associated with lifestyle and other environmental exposures that may impact methylation), alcohol use status, body mass index, and cell type proportions. We meta-analyzed the associations across cohorts and selected CpG sites with association FDR-adjusted q-value <0.05. We next aggregated individuallevel data across the cohort-specific training datasets, and applied two-stage weighted least squares Lasso regression, with the GPCs as the outcomes and the selected CpG sites as penalized predictors, adjusting for the aforementioned covariates. The developed MPSs are the weighted sum of selected CpG sites from the Lasso. To evaluate the developed MPSs, we constructed them in the test dataset, and compared them with GPCs, and with MPSs constructed based on a previously-published paper. Comparison was based on correlation analysis and data visualization. We demonstrate the use of the MPSs in EWAS. RESULTS: In the test dataset, the MPSs were highly correlated with GPCs, with correlation decreasing, though not monotonically, for later components. Specifically, MPS1 and GPC1 had R2= 0.99, while MPS7 and GPC7 had R2=0.27 (the lowest observed correlation). In data visualization, MPSs had similar patterns as GPCs in differentiating self-reported White, Black, and Hispanic/Latino groups, while outperforming MPC constructed using alternative published methods. MPSs showed comparable performance to GPCs in reducing some of the inflation in EWAS. CONCLUSIONS: Methylation-based population scores provide a reliable estimate of population structure in the data and can complement GPCs when genetic data are absent. Unlike previous methods based on unsupervised methylation PCA, MPSs uses supervised learning with covariate adjustment to capture genetic structure across diverse populations. The weights for each GPCs derived in our study can be applied to generate MPSs in other studies.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。