Run-length compressed metagenomic read classification with SMEM-finding and tagging

基于SMEM查找和标记的运行长度压缩宏基因组读段分类

阅读:1

Abstract

Metagenomic read classification is a fundamental task in computational biology but remains challenging due to the scale and diversity of sequencing data. We present a run-length compressed BWT-based index using the move structure for efficient multi-class classification. Our method finds all super-maximal exact matches (SMEMs) of length ≥ L between a read and a reference and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs and their class identifiers into a single classification. We are the first to perform run-length compressed read classification using full rather than semi-SMEMs. We evaluated on long and short reads across two datasets: a large bacterial pan-genome with few classes and a smaller 16S rRNA gene database spanning thousands of genera. Our method outperforms SPUMONI 2 in accuracy and runtime while maintaining run-length compressed memory complexity and surpasses Cliffy in memory efficiency with comparable accuracy.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。