Abstract
Influenza A virus (IAV) poses a persistent threat to global public health due to its broad host adaptability, frequent anti-genic variation, and potential for cross-species transmission. Accurate identification of IAV subtypes is essential for effective epidemic surveillance and precise disease control. Here, we present Influ-BERT, a domain-adaptive pretrained model based on the Transformer architecture. Optimized from DNABERT-2, Influ-BERT was developed using a dedicated corpus of ~900 000 influenza genome sequences. We constructed a custom Byte Pair Encoding tokenizer, and employed a two-stage training strategy involving domain-adaptive pretraining followed by task-specific fine-tuning. This approach significantly enhanced identification performance for IAV subtypes. Experimental results demonstrate that Influ-BERT outperforms both traditional machine learning approaches and general genomic language models, such as DNABERT-2, Necleotide Transformer, and MegaDNA, in the task of IAV subtype identification. The model consistently achieved F1-scores above 97% across five subtype classification tasks and exhibited stable performance gains for subtypes that are underrepresented in sequencing data, including H5N8, H1N2, and H13N6. Beyond subtype identification, Influ-BERT was successfully applied to additional tasks including respiratory virus identification, IAV pathogenicity prediction, and identification of IAV genomic fragments and functional genes, demonstrating robust performance throughout. Further interpretability analysis using sliding window perturbation confirmed that the model focuses on biologically significant genomic regions, providing insight into its improved predictive capability.