KaMLs for Predicting Protein pK(a) Values and Ionization States: Are Trees All You Need?

KaMLs用于预测蛋白质pK(a)值和电离状态:树状结构就足够了吗?

阅读:1

Abstract

Despite its importance in understanding biology and computer-aided drug discovery, the accurate prediction of protein ionization states remains a formidable challenge. Physics-based approaches struggle to capture the small, competing contributions in the complex protein environment, while machine learning (ML) is hampered by the scarcity of experimental data. Here, we report the development of pK(a) ML (KaML) models based on decision trees and graph attention networks (GAT), exploiting physicochemical understanding and a new experiment pK(a) database (PKAD-3) enriched with highly shifted pK(a)'s. KaML-CBtree significantly outperforms the current state of the art in predicting pK(a) values and ionization states across all six titratable amino acids, notably achieving accurate predictions for deprotonated cysteines and lysines─a blind spot in previous models. The superior performance of KaMLs is achieved in part through several innovations, including the separate treatment of acid and base, data augmentation using AlphaFold structures, and model pretraining on a theoretical pK(a) database. We also introduce the classification of protonation states as a metric for evaluating pK(a) prediction models. A meta-feature analysis suggests a possible reason for the lightweight tree model to outperform the more complex deep learning GAT. We release an end-to-end pK(a) predictor based on KaML-CBtree and the new PKAD-3 database, which facilitates a variety of applications and provides the foundation for further advances in protein electrostatic research.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。