Abstract
MOTIVATION: Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. Phage sequences from metagenome or virome assemblies are often fragmented, and the diversity of environmental phages is not well known. Current computational approaches often rely on database comparisons that require significant effort and expertise to update. We propose using genomic language models (LMs) for phage lifestyle classification, allowing efficient direct analysis from nucleotide sequences without the need for sophisticated preprocessing pipelines or manually curated databases. We trained three genomic LMs (DNABERT-2, Nucleotide Transformer, and ProkBERT) on datasets of short, fragmented sequences. These models were then compared with dedicated phage lifestyle prediction methods in terms of accuracy, prediction speed, and generalization capability. RESULTS: ProkBERT PhaStyle achieves accuracy comparable to, and in many cases higher than, state-of-the-art models across various scenarios. It demonstrates the ability to generalize to unseen data in our benchmarks, accurately classifies phages from extreme environments, and also demonstrates high inference speed. AVAILABILITY AND IMPLEMENTATION: Genomic LMs offer a simple and computationally efficient alternative for solving complex classification tasks, such as phage lifestyle prediction. ProkBERT PhaStyle's simplicity, speed, and performance suggest its utility in various ecological and clinical applications.