The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers

利用机器学习和短序列k-mer进行病毒宿主预测时,分类学特征、宿主依赖性特征和样本偏差的影响

阅读:1

Abstract

Metaviromic studies of potential emerging infection reservoirs led to discovery of many novel viruses. Since metaviromes contain viruses from target host, its food or other sources, fast and robust approaches are needed to predict hosts of unknown viruses based on their genome data. Four machine learning algorithms (random forest, two gradient boosting machines, support vector machine) were used here to predict the hosts of RNA viruses that infect mammals, insects and plants. The prediction efficiency was largely dependent on the dataset composition. In the more challenging task of predicting hosts of unknown virus genera, median weighted F1-score of 0.79 was achieved using support vector machine and 4-mer frequencies, a notable improvement over baseline methods (median weighted F1-scores 0.68 for the homology-based tBLASTx and 0.72 for ML trained on mono-, di- and trinucleotide frequencies). More complicated features and feature combinations provided worse results. When predicting hosts of short virus sequence fragments quality decreased but using same-length fragments instead of full genomes for training consistently produced an improvement of prediction quality. Therefore, short k-mers carry sufficient information to predict hosts of novel RNA virus genera. This algorithm can be useful in rapid analysis of metaviromic data to highlight potential biological threats.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。