Abstract
Understanding how viruses evolve to infect specific hosts is crucial for reliably predicting and preventing emerging virus diseases. While genomic signatures of host adaptation exist within virus sequences, the systematic approaches needed to identify these patterns across the virome are in their infancy and will require well-curated datasets as a foundation for rigorous analysis. Although repositories like GenBank contain extensive genomic data, the critical metadata needed for artificial intelligence (AI) model training often exists in non-standardized, heterogeneous formats, complicating preprocessing, and hindering large-scale analysis across diverse virus families. To overcome these challenges and enable systematic investigation of genomic patterns, we built a dataset of 58 046 virus genomes spanning 15 families that represent diverse genome architectures and host ranges. Each sequence was classified for human host compatibility based on isolation source, creating a foundation for studying sequence-level determinants of host compatibility. We demonstrate the utility of this resource by applying AI through neural networks trained on k-mer frequency patterns and provide multiple analytical frameworks that enable generation of testable hypotheses about the genomic determinants of human-host compatibility. Finally, we apply the model to SARS-CoV-2 genomes to assess its ability to identify host-compatibility signals and characterize lineage- and time-associated shifts in k-mer usage. Using these neural networks to analyze k-mer frequency patterns across viral taxa, we found that sequence-based features can accurately predict human host compatibility and even generalize to some members of virus families not included in the training set. This performance variation suggests different forms of host-specific signal in virus genomes, likely reflecting distinct evolutionary pressures across virus groups. This resource can accelerate exploration of pan-virome lexicographical patterns that define host compatibility and will provide new avenues for identifying the genetic determinants of virus host range evolution.