What We Talk About When We Talk About Microbial Species

我们谈论微生物物种时都在谈论什么

阅读:1

Abstract

Genome annotation, alignment and phylogenetics are at the centre of most studies in evolutionary genomics. These techniques function best when rooted in prior work. Genes are mined from new genomes using evidence from old gene models. These genomes are aligned to well-worn references to create matrices for tree reconstruction. Trees are often populated with well-characterised genomes to add context to the newly sequenced. Genome inference traces a line back to model organisms, yoking the analysis of new genomes to layers of previous knowledge. Here, we present an alternative approach that uses unannotated and unaligned sequence to understand the information diversity of sequence ensembles. Any set of genomes can comprise our sequence ensemble. In a pandemic context, a sequence ensemble might be clinically isolated strains from one day. In a systematic context, a sequence ensemble could be the pangenome available for a clade. The normal bioinformatics playbook would have us align. But we instead compress. A sequence ensemble that compresses easily contains lower information diversity. For pandemics, we can use curves of information diversity to trace genomic novelty and monitor selective sweeps in new strains. For systematics, we can calculate compressibility quickly across all known bacterial taxa, levelling the criteria for species across clades. If we tolerate data loss, we can go one step further and capture structural evolution as we compress. Our approach sacrifices a lot. We skip many of the products of modern bioinformatics like variation anchored to known genes or genome alignment to prescribed references or pangenome graphs. But we gain speed, breadth and the ability to rapidly respond to novelty.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。