Benchmarking long-read assembly tools and preprocessing strategies for bacterial genomes: A case study on E. coli DH5α

细菌基因组长读组装工具和预处理策略的基准测试:以大肠杆菌DH5α为例

阅读:3

Abstract

Genome assembly is a crucial step in microbial genomics, significantly impacting downstream applications such as functional annotation and comparative genomics. While long-read sequencing technologies have improved genome reconstruction, the choice of assembler and preprocessing methods substantially influences assembly quality. Genome assembly is a crucial step in microbial genomics, significantly impacting downstream applications such as functional annotation and comparative genomics. While long-read sequencing technologies have improved genome reconstruction, the choice of assembler and preprocessing methods substantially influences assembly quality. Here, we benchmarked eleven long-read assemblers-Canu, Flye, HINGE, Miniasm, NECAT, NextDenovo, Raven, Shasta, SmartDenovo, wtdbg2 (Redbean), and Unicycler-using standardized computational resources. Assemblies were evaluated on runtime, contiguity (N50, total length, contig count), GC content, and completeness using Benchmarking Universal Single-Copy Orthologs (BUSCO). Assemblers employing progressive error correction with consensus refinement, notably NextDenovo and NECAT, consistently generated near-complete, single-contig assemblies with low misassemblies and stable performance across preprocessing types. Flye offered a strong balance of accuracy and contiguity, although it was sensitive to corrected input. Canu achieved high accuracy but produced fragmented assemblies (3-5 contigs) and required the longest runtimes. Unicycler reliably produced circular assemblies but with slightly shorter contigs than Flye or NextDenovo. Ultrafast tools such as Miniasm and Shasta provided rapid draft assemblies, yet were highly dependent on preprocessing and required polishing to achieve completeness. HINGE and wtdbg2 underperformed due to structural instability and fragmentation. Preprocessing had a marked effect: filtering improved genome fraction and BUSCO completeness, trimming reduced low-quality artifacts, and correction benefited OLC-based assemblers but occasionally increased misassemblies in graph-based tools. Overall, assembler choice and preprocessing jointly determine accuracy, contiguity, and computational efficiency. These results provide a reproducible framework for selecting assembly pipelines in prokaryotic genomics, underscoring that no single assembler is universally optimal.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。