Linear time minimum segmentation enables scalable founder reconstruction.

阅读:5
作者:Norri Tuukka, Cazaux Bastien, Kosolobov Dmitry, Mäkinen Veli
BACKGROUND:  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R = {R1, …, Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment [a, b] ∈ P has length at least L and the number d(a, b) = |{Ri[a, b]:1 ≤ i ≤ m}| of distinct substrings at segment [a, b] is minimized over [a, b] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form max {d(a, b):[a, b] ∈ P} founder sequences representing the original R such that crossovers happen only at segment boundaries. RESULTS:  We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2) . CONCLUSIONS:  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。