Abstract
MOTIVATION: N-linked glycosylation is one of the most basic post-translational modifications (PTMs) where oligosaccharides covalently bond with Asparagine (N). These are found in the conserved regions like N-X-S or N-X-T where X can be any residue except Proline (P). Prediction of N-linked glycosylation sites has great importance as these PTMs play a vital role in many biological processes and functionalities. Experimental methods, such as mass spectrometry, for detecting N-linked glycosylation sites are very expensive. Therefore, the prediction of N-linked glycosylation sites has become an important research field. RESULTS: In this work, we propose StackGlyEmbed, a stacking ensemble machine learning model, to computationally predict N-linked glycosylation sites. We have explored embeddings from several protein language models and built the stacking ensemble using Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and K-nearest Neighbor (KNN) learners in the base layer, with a second SVM model in the meta layer. StackGlyEmbed achieves 98.2% sensitivity, 92.5% balanced accuracy, 89.1% F1-score and 82.6% Matthew's correlation coefficient in independent testing, outperforming the existing state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION: StackGlyEmbed is freely available at: https://github.com/nafcoder/StackGlyEmbed.