Influ-BERT: a domain-adaptive genomic language model for advancing influenza A virus research

Influ-BERT:一种用于推进甲型流感病毒研究的领域自适应基因组语言模型

阅读:2

Abstract

Influenza A virus (IAV) poses a persistent threat to global public health due to its broad host adaptability, frequent anti-genic variation, and potential for cross-species transmission. Accurate identification of IAV subtypes is essential for effective epidemic surveillance and precise disease control. Here, we present Influ-BERT, a domain-adaptive pretrained model based on the Transformer architecture. Optimized from DNABERT-2, Influ-BERT was developed using a dedicated corpus of ~900 000 influenza genome sequences. We constructed a custom Byte Pair Encoding tokenizer, and employed a two-stage training strategy involving domain-adaptive pretraining followed by task-specific fine-tuning. This approach significantly enhanced identification performance for IAV subtypes. Experimental results demonstrate that Influ-BERT outperforms both traditional machine learning approaches and general genomic language models, such as DNABERT-2, Necleotide Transformer, and MegaDNA, in the task of IAV subtype identification. The model consistently achieved F1-scores above 97% across five subtype classification tasks and exhibited stable performance gains for subtypes that are underrepresented in sequencing data, including H5N8, H1N2, and H13N6. Beyond subtype identification, Influ-BERT was successfully applied to additional tasks including respiratory virus identification, IAV pathogenicity prediction, and identification of IAV genomic fragments and functional genes, demonstrating robust performance throughout. Further interpretability analysis using sliding window perturbation confirmed that the model focuses on biologically significant genomic regions, providing insight into its improved predictive capability.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。