Abstract
The PDBBind database has been widely utilized for the computational prediction of protein-protein binding affinities. While the accuracy of the PDBBind-curated equilibrium dissociation constants (KD) has been reported for the protein-ligand subset of the PDBBind database, the curation accuracy has not been reported for the protein-protein subset. Here, we present a detailed manual analysis for the subset of PDBBind records with PubMed Central Open Access primary publications and find that ~19% of these records had KD values that were not supported by their primary publications. The impact of these putative curation errors on the machine learning-based prediction of KD from experimental protein-protein 3D structures was evaluated and correcting the curation errors improved the Pearson correlation coefficient between measured and random forest-predicted log10(KD) values by ~8 percentage points. This finding underscores the importance of dataset accuracy for computational modelling and highlights the need for more stringent curation processes when extracting information from the scientific literature.