vcfsim: flexible simulation of all-sites VCFs with missing data

vcfsim:灵活模拟包含缺失数据的全站点 VCF

阅读:1

Abstract

BACKGROUND |: VCFs are the most widely used data format for encoding genetic variation. By design, standard VCFs do not include data from sites where all individuals are homozygous for the reference allele ("invariant sites") and thus do not differentiate these from sites where data are completely missing. However, missing data are a key feature of biological datasets across all domains of genomics, and many recent studies have shown that missing data can introduce a variety of statistical biases in the estimation of key population genetic parameters. A solution to this limitation is to include invariant sites in a standard VCF, creating an "all-sites VCF", exposing missing and invariant sites explicitly. One hurdle to the wider adoption of all-sites VCFs is a reliable parameterized simulation framework for generating biologically realistic all-sites VCFs. RESULTS |: Here, we introduce an open-source command line tool, vcfsim, that interfaces with the popular coalescent simulation platform msprime and provides convenience functions for simulating all-sites VCFs with variable levels of ploidy and missing data. We show that the post-processed VCFs generated using vcfsim align precisely with population genetic expectations (i.e. are statistically identical to raw msprime output), accurately introduce missing data, and permit the simulation of data with varying ploidy levels, including the simulation of intraindividual ploidy variation (e.g. heterogametic sex chromosomes) and population structures. CONCLUSIONS |: Our results vcfsim is a useful and easy-to-use tool for the benchmarking of new software tools, performing population genetic inference, training of machine learning models, and the exploration of the effects of missing data in genomics data sets.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。