Learning-based parallel acceleration for HaplotypeCaller

基于学习的 HaplotypeCaller 并行加速

阅读:1

Abstract

In the genome analysis workflow, Genome Analysis Toolkit (GATK) HaplotypeCaller is a widely used variant calling tool designed to accurately identify single nucleotide polymorphisms (SNPs) and insertions/deletions (Indels) in samples. However, when processing large-scale datasets, HaplotypeCaller often faces the challenge of excessively long runtime. Parallelizing GATK HaplotypeCaller with data segmentation is an effective solution, but existing methods struggle to accurately estimate the computational complexity of each data block, leading to severe computational skew. This paper introduces a learning-based framework LPA (learning-based parallel acceleration), leveraging model to accurately predict the computational complexity of data. By employing adaptive data segmentation algorithms and Multi-Knapsack Problem (MKP) based task scheduling, LPA significantly alleviates computational skew. We evaluated LPA in multiple datasets, demonstrating that its execution speed is 30x–40x faster than HaplotypeCaller and 2x–5x faster than HaplotypeCallerSpark. LPA achieves a speedup of 1.3x–2x compared to similar methods. And LPA maintaining a high accuracy with over 99.9%, enhancing the efficiency and reliability of variant calling. The source code of LPA is publicly available at https://github.com/laixx9/LPA.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。