ProkBERT PhaStyle: accurate phage lifestyle prediction with pretrained genomic language models

ProkBERT PhaStyle:利用预训练基因组语言模型准确预测噬菌体生活方式

阅读:1

Abstract

MOTIVATION: Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. Phage sequences from metagenome or virome assemblies are often fragmented, and the diversity of environmental phages is not well known. Current computational approaches often rely on database comparisons that require significant effort and expertise to update. We propose using genomic language models (LMs) for phage lifestyle classification, allowing efficient direct analysis from nucleotide sequences without the need for sophisticated preprocessing pipelines or manually curated databases. We trained three genomic LMs (DNABERT-2, Nucleotide Transformer, and ProkBERT) on datasets of short, fragmented sequences. These models were then compared with dedicated phage lifestyle prediction methods in terms of accuracy, prediction speed, and generalization capability. RESULTS: ProkBERT PhaStyle achieves accuracy comparable to, and in many cases higher than, state-of-the-art models across various scenarios. It demonstrates the ability to generalize to unseen data in our benchmarks, accurately classifies phages from extreme environments, and also demonstrates high inference speed. AVAILABILITY AND IMPLEMENTATION: Genomic LMs offer a simple and computationally efficient alternative for solving complex classification tasks, such as phage lifestyle prediction. ProkBERT PhaStyle's simplicity, speed, and performance suggest its utility in various ecological and clinical applications.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。