Abstract
The recent interest in peptides incorporating non-canonical amino acids has surged within the scientific community, driven by their enhanced stability and resistance to proteolytic degradation. These so-called non-canonical peptides offer significant potential for modifying biological, pharmacological, and physiochemical characteristics in both native and synthetic contexts. Despite their advantages, there remains a notable gap in the availability of an efficient pre-trained model capable of effectively capturing feature representations from such intricate peptide sequences. This study herein introduces PepLand, a novel pre-training framework designed for the comprehensive representation and analysis of peptides, encompassing both canonical and non-canonical amino acids. PepLand leverages a general-purpose multi-view heterogeneous graph neural network to unveil the subtle structural representations of peptides. Our empirical evaluations demonstrate PepLand's proficiency in a range of peptide property prediction tasks, including cell penetrability, solubility, and protein-peptide binding affinity. These rigorous assessments affirm PepLand's superior capability in discerning critical representations of peptides with both canonical and non-canonical amino acids, and provide a robust foundation for transformative advances in peptide-focused pharmaceutical research. We have made the entire source code and datasets available at http://www.healthinformaticslab.org/supp/resources.php or https://github.com/zhangruochi/PepLand.