Haplotype Matching with GBWT for Pangenome Graphs

利用GBWT进行泛基因组图的单倍型匹配

阅读:1

Abstract

Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。