PepBERT: Lightweight language models for bioactive peptide representation

PepBERT:用于生物活性肽表示的轻量级语言模型

阅读:1

Abstract

Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model-PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters)-were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available at https://github.com/dzjxzyd/PepBERT.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。