Rapid discovery of new-to-nature protein domains by novelty-first forcing of language models

通过以新颖性为先导的语言模型强制方法,快速发现自然界中新出现的蛋白质结构域

阅读:1

Abstract

Approximations for the existence and extent of physically permissible protein structures beyond those found in nature vary wildly. As predicted structure databases swell thanks to abundant sequence data and generative protein design models concurrently grow in their power to propose new aspects of protein structure, these questions and those of which essential features (e.g. stability, function, robustness) distinguish natural domains from novel ones have been cast in even sharper relief. We demonstrate that protein language models (PLMs) can simultaneously innovate in sequence and structure to suggest new-to-nature protein domains displaying supersecondary and tertiary elements outside of categorized CATH superfamilies. Developing and applying two orthogonal processes for obtaining compact and globular folds from PLMs without bias from other physicochemical or functional constraints, we discover putative novel domains that emerge parallel to known natural ones at rates far exceeding those obtainable by bioinformatic mining of structure databases. Computational characterization of these domain candidates indicates that many exhibit reasonable folding thermodynamics and kinetics, suggesting that natural protein structure-space is far from biophysically complete. These results point away from stability as the definitive selective force behind the observed landscape of real protein folds, and insinuate that many unrealized folds may be equally consistent with the structural rules of protein-based life.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。