PharaCon: a new framework for identifying bacteriophages via conditional representation learning

PharaCon:一种通过条件表征学习识别噬菌体的新框架

阅读:1

Abstract

MOTIVATION: Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples. RESULTS: To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning. AVAILABILITY AND IMPLEMENTATION: The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。