SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework

SparkGA2:基于 Apache Spark 的生产级、内存高效基因组分析框架

阅读:1

Abstract

Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。