Abstract
Phages (or bacteriophages) play a critical role in microbial communities, and accurately predicting the hosts of phages is essential for understanding the dynamics of these viruses and their impact on bacterial populations. In the prediction of classification of phage hosts, feature extraction is a critical step that directly affects the accuracy of the predictions. Among the techniques used for feature extraction, k-mers are the most commonly employed method. Although many methods based on k-mers have been proposed, these methods typically use only the frequency information of k-mers as features. However, when frequencies are identical, the frequency information of these k-mers becomes less useful. To address this limitation, we propose a novel method called PhageCGRNet, which not only utilizes the frequency information of k-mers but also incorporates the positional information of k-mers. In our method, we represent each genome sequence as a three-dimensional matrix containing k-mers frequency features and positional features, and then utilize the Convolutional Neural Network model to predict the host category. Specifically, we combine the frequency information of k-mers directly extracted from the sequences with the positional information of k-mers obtained using the Chaos Game Representation method to construct the feature matrix, which serves as the input to the Convolutional Neural Network. We conducted experiments on two benchmark datasets, and compared PhageCGRNet with existing advanced methods for phage host classification. The experimental results demonstrate that PhageCGRNet achieves higher accuracy at both taxonomy levels of species and genus on these two datasets compared to other state-of-the-art methods.