Generating Correlated Data for Omics Simulation

生成用于组学模拟的相关数据

阅读:1

Abstract

Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each samples and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to the inconvenience and computational hurdles of doing so. To alleviate this, we describe in detail three approaches for quickly generating omics-scale data with correlated measures which mimic real data sets. These approaches all are based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。