Abstract
MOTIVATION: Glycans are highly diverse biological sequences, but their functional understanding has lagged behind proteins and nucleic acids. Many glycans remain ambiguously annotated, limiting computational analyses. Existing computational approaches are primarily graph-based, capturing local structural features but struggling to model global patterns and incomplete sequences. RESULTS: We present GlycanGT, a graph-transformer-based pretrained model for glycans. Glycans were represented as graphs of monosaccharides and glycosidic bonds, and the model was pretrained using a masked language modeling objective. GlycanGT demonstrated higher performance than existing methods across 8 benchmark tasks (e.g., 0.844 AUPRC for immunogenicity classification), and its embeddings formed biologically relevant clusters that recovered known N- and O-glycan categories. Moreover, GlycanGT accurately proposed candidates for ambiguous sequences, maintaining >80% top-5 accuracy for both monosaccharide and glycosidic bond predictions under high masking levels. AVAILABILITY AND IMPLEMENTATION: The source code used in this study is available at https://github.com/matsui-lab/GlycanGT and archived on Zenodo (DOI: 10.5281/zenodo.18636040); pretrained model weights are provided via Hugging Face (https://huggingface.co/Akikitani295/GlycanGT). CONTACT: matsui.yusuke.d4@f.mail.nagoya-u.ac.jp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.