Abstract
In the genome analysis workflow, Genome Analysis Toolkit (GATK) HaplotypeCaller is a widely used variant calling tool designed to accurately identify single nucleotide polymorphisms (SNPs) and insertions/deletions (Indels) in samples. However, when processing large-scale datasets, HaplotypeCaller often faces the challenge of excessively long runtime. Parallelizing GATK HaplotypeCaller with data segmentation is an effective solution, but existing methods struggle to accurately estimate the computational complexity of each data block, leading to severe computational skew. This paper introduces a learning-based framework LPA (learning-based parallel acceleration), leveraging model to accurately predict the computational complexity of data. By employing adaptive data segmentation algorithms and Multi-Knapsack Problem (MKP) based task scheduling, LPA significantly alleviates computational skew. We evaluated LPA in multiple datasets, demonstrating that its execution speed is 30x–40x faster than HaplotypeCaller and 2x–5x faster than HaplotypeCallerSpark. LPA achieves a speedup of 1.3x–2x compared to similar methods. And LPA maintaining a high accuracy with over 99.9%, enhancing the efficiency and reliability of variant calling. The source code of LPA is publicly available at https://github.com/laixx9/LPA.