From single-sequences to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2

从单序列到进化轨迹:蛋白质语言模型捕捉SARS-CoV-2的进化潜力

阅读:4

Abstract

Protein language models (PLMs) capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have been suggested to have the potential to capture functional properties of proteins. Here, we investigate how these representations enable assessment of variation due to mutation. Applied to the SARS-CoV-2 spike protein via in silico deep mutational scanning (DMS), the PLM ESM-2 captures evolutionary constraints directly from sequence context, recapitulating what normally requires MSA data. Unlike other state-of-the-art methods which require protein structures or multiple sequences for training, we show what can be accomplished using an unmodified pretrained PLM. Applied to SARS-CoV-2 variants across the pandemic, we demonstrate that ESM-2 representations encode the evolutionary history between variants, as well as the distinct nature of variants of concern upon their emergence, associated with shifts in receptor binding and antigenicity. ESM-2 likelihoods can also identify epistatic interactions among sites in the protein. Our results here affirm that PLMs like ESM-2 are broadly useful for variant-effect prediction, including unobserved changes, and can be applied to understand novel viral pathogens with the potential to be applied to any protein sequence, pathogen or otherwise.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。