Abstract
Identifying viruses in tumor transcriptome helps to unravel the potential role of viruses in oncogenesis and tumor progression. Most of the current tools for virus identification in RNA-Seq data rely on sequence alignment, whose performance is constrained by fast mutations, large divergence, and the incompleteness of viral genomes. In this study, we develop ViTrace to detect viral sequences in human transcriptomic data by a hybrid language representation learning model, which integrates DNA contexts, position relationships and amino acid coding information. Although ViTrace is only trained on 13 species from 7 genera, it achieves 86.39% recall in 1179 absent-in-train virus strains of 935 species belonging to 167 genera across 10 phyla. Applied to single-cell RNA-seq data from esophageal and oropharyngeal squamous cell carcinomas, the model reveals tumor-, cell-, and patient-specific viral colonization patterns, uncovering both known and previously unreported viruses. Overall, ViTrace provides a scalable framework for guiding precision oncology and facilitating future discoveries of previously uncharacterized tumor-associated viruses.