Abstract
Protein-carbohydrate interactions play an important role in many biological processes and functions, like inflammation, signal transduction, and cell adhesion. In our work, we will study non-covalent carbohydrate binding sites. In this paper, we aim to build a deep-learning model to predict non-covalent protein-carbohydrate binding sites. We were motivated by the fact that experimental approaches for predicting these sites are expensive. So, computational tools are necessary for identifying these interactions. We explored several sequence-based features as well as structural features. We also leveraged protein language model embeddings. We analyzed different architectures and selected the most suitable deep learning architecture for our finalized prediction model, DeepCPBSite. DeepCPBSite is an ensemble model that combines three separate models with three approaches (random undersampling, weighted oversampling, and class-weighted loss) built on the ResNet+FNN architecture. We made separate datasets from three sources: RCSB, UniProt, and CASP. We also compared the structural features extracted from the structures predicted by AlphaFold and ESMFold in the context of our prediction tasks. We employed three different feature selection techniques and finally did a SHAP (SHapley Additive exPlanations) analysis on the structural features after categorizing the proteins based on their organism information. DeepCPBSite achieved 78.7% balanced accuracy and 59.6% sensitivity on the TS53 set, outperforming the second-best competitor, DeepGlycanSite, by 1.16% and 2.94%, respectively. Additionally, its F1, MCC, and AUPR scores outperformed other state-of-the-art methods, with improvements ranging from 3.77%-47.6%, 3.84%-32.7%, and 8.18%-60.21%, respectively.