Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65

用于无偏的基于序列的PPI预测的深度学习模型,其准确率在0.65左右达到平台期。

阅读:2

Abstract

MOTIVATION: As most proteins interact with other proteins to perform their respective functions, methods to computationally predict these interactions have been developed. However, flawed evaluation schemes and data leakage in test sets have obscured the fact that sequence-based protein-protein interaction (PPI) prediction is still an open problem. Recently, methods achieving better-than-random performance on leakage-reduced PPI data have been proposed. RESULTS: Here, we show that the use of ESM-2 protein embeddings explains this performance gain irrespective of model architecture. We compared the performance of models with varying complexity, per-protein, and per-token embeddings, as well as the influence of self- or cross-attention, where all models plateaued at an accuracy of 0.65. Moreover, we show that the tested sequence-based models cannot implicitly learn a contact map as an intermediate layer. These results imply that other input types, such as structure, might be necessary for producing reliable PPI predictions. AVAILABILITY AND IMPLEMENTATION: All code for models and execution of the models is available at https://github.com/daisybio/PPI_prediction_study. Python version 3.8.18 and PyTorch version 2.1.1 were used for this study. The environment containing the versions of all other packages used can be found in the GitHub repository. The used data are available at https://doi.org/10.6084/m9.figshare.21591618.v3.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。