Abstract
Artificial intelligence has emerged as a pivotal tool for structure prediction of biomolecular assemblies. The AlphaFold (AF) framework as a landmark in this field has spurred the development and refinement of a variety of folding methods. The revolution is further sparked with the release of the substantially updated architecture, AlphaFold3 (AF3), which is reported as the state-of-the-art prediction model for protein-protein complexes. Here, we evaluate the quality of the deep-learning tool as well as its precedents on an extensive dataset of protein-protein complexes and assess whether they could replace structures deposited experimentally in various critical modelling and screening tasks, e.g., contacts prediction of interfacial regions, hot-spot identifications and binding affinities calculations. Structurally, although the prediction accuracy of these deep-learning tools seems high based on quality metrics such as DockQ and RMSD, major inconsistencies/deviations from experiment are observed in the compactness of the complex, the intermolecular directional polar interactions (>2 hydrogen bonds are incorrectly predicted) and interfacial contacts (especially the apolar-apolar packing for AF3), which adds a caution when applying AF predictions to understand key interactions stabilizing protein-protein complexes. Interestingly, while the latest AF3 exhibits a prediction accuracy obviously higher than its precedents in direct prediction-experiment comparisons, after simulation relaxation, the quality of structural ensembles sampled in molecular simulations drops severely. Such a deterioration could be attributed to many influential factors, e.g., the instability of the predicted intermolecular packing or the inaccuracy of force fields. Consequently, the quality of sampled structural ensembles from all predictions still exhibits noticeable differences from the experimental reference. Based on the simulation trajectories, an example of using AF predictions in practical virtual screening tasks is presented. A physics-based hot-spot scan with the alanine scanning with generalized Born and interaction entropy method is conducted, which provides mutation-induced affinity variations of protein-protein complexes. Face-to-face comparisons between computed affinity variations and experimental measurements reveal that predictions employing experimental structures as starting configurations outperform those with predicted structures, regardless of the version of the AF derivatives. As for hot-spot identification, the prediction quality using experimental structures is better than those based on predicted structures. More interestingly, we align the structural deviations of the predicted structures with quality metrics of affinity calculations and observe little correlations, which suggests that one cannot directly infer the quality of thermodynamic calculations with the quality of structure prediction. These observations provide a unique simulation perspective of the structure prediction tool, AF derivatives, and more importantly practical guidance of their applicability in molecular modelling of protein-protein assemblies (e.g., protein design).