SARITA: a large language model for generating the S1 subunit of the SARS-CoV-2 spike protein

SARITA:一种用于生成SARS-CoV-2刺突蛋白S1亚基的大型语言模型

阅读:3

Abstract

BACKGROUND: The COVID-19 pandemic has caused over 776 million infections and 7 million deaths globally between December 2019 and November 2024. Since the emergence of the original Wuhan strain, SARS-CoV-2 has evolved into multiple variants-including Alpha, Delta, and Omicron-primarily through mutations in the Spike glycoprotein. The S1 subunit, which binds the human angiotensin-converting enzyme 2 (ACE2) receptor, mutates frequently and plays a key role in infectivity and immune escape, while the more conserved S2 subunit mediates membrane fusion. Anticipating future mutations is essential for guiding vaccine design and therapeutic strategies. Generative Large Language Models (LLMs) have shown promise in protein sequence modeling due to their capacity to produce realistic and functional synthetic sequences. Here, we introduce SARITA, a GPT-3-based LLM with up to 1.2 billion parameters, fine-tuned via continual learning on the protein model RITA trained on 107 017 high-quality SARS-CoV-2 Spike sequences (up to March 1st 2021) to generate high-quality synthetic SARS-CoV-2 Spike S1 subunits. RESULTS: SARITA is able to generate realistic, full-length synthetic S1 subunits starting from a 14-amino-acid prompt. When evaluated on unseen sequences collected between March 2021 and November 2023-including major Variants of Concern (VOCs) such as Delta and Omicron, and Variants of Interest such as Iota-SARITA outperforms baseline and state-of-the-art LLMs in terms of sequence quality, biological plausibility, and similarity to real-world viral evolution. SARITA generates high-quality sequences in over 97% of cases, with markedly lower False Mutation Rate and higher similarity scores (PAM30, Levenshtein distance) compared to alternative approaches. It also accurately reproduces key mutations characteristic of future variants-such as L212I, R158L, T95P, and E406K-which were not present in the training data but emerged later in VOCs like Omicron and Delta. Structure-based analysis confirms the functional plausibility of these substitutions, with ΔΔG values within experimentally supported thresholds for ACE2 and antibody binding. Furthermore, SARITA anticipates immune-evasive mutations and accurately captures the positional and statistical distribution of mutations found in post- March 1st 2021 variants, highlighting its potential as a predictive tool for viral evolution. CONCLUSION: These results indicate the potential of SARITA to predict future SARS-CoV-2 S1 evolution, potentially aiding in the development of adaptable vaccines and treatments.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。