Sequence based virus host prediction: a curated dataset and generalizable framework for training artificial intelligence to identify viruses of humans

基于序列的病毒宿主预测:一个精心整理的数据集和一个可通用的框架,用于训练人工智能识别人类病毒。

阅读:3

Abstract

Understanding how viruses evolve to infect specific hosts is crucial for reliably predicting and preventing emerging virus diseases. While genomic signatures of host adaptation exist within virus sequences, the systematic approaches needed to identify these patterns across the virome are in their infancy and will require well-curated datasets as a foundation for rigorous analysis. Although repositories like GenBank contain extensive genomic data, the critical metadata needed for artificial intelligence (AI) model training often exists in non-standardized, heterogeneous formats, complicating preprocessing, and hindering large-scale analysis across diverse virus families. To overcome these challenges and enable systematic investigation of genomic patterns, we built a dataset of 58 046 virus genomes spanning 15 families that represent diverse genome architectures and host ranges. Each sequence was classified for human host compatibility based on isolation source, creating a foundation for studying sequence-level determinants of host compatibility. We demonstrate the utility of this resource by applying AI through neural networks trained on k-mer frequency patterns and provide multiple analytical frameworks that enable generation of testable hypotheses about the genomic determinants of human-host compatibility. Finally, we apply the model to SARS-CoV-2 genomes to assess its ability to identify host-compatibility signals and characterize lineage- and time-associated shifts in k-mer usage. Using these neural networks to analyze k-mer frequency patterns across viral taxa, we found that sequence-based features can accurately predict human host compatibility and even generalize to some members of virus families not included in the training set. This performance variation suggests different forms of host-specific signal in virus genomes, likely reflecting distinct evolutionary pressures across virus groups. This resource can accelerate exploration of pan-virome lexicographical patterns that define host compatibility and will provide new avenues for identifying the genetic determinants of virus host range evolution.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。