Abstract
Various deep learning based methods have significantly impacted the realm of drug discovery. The development of deep learning methods for identifying novel structural types of active compounds has become an urgent challenge. In this paper, we introduce a self-supervised representation learning framework, i.e., Geometry-based Bidirectional Encoder Representations from Transformers (GEO-BERT). GEO-BERT considers the information of atoms and chemical bonds in chemical structures as the input, and integrates the positional information of the three-dimensional conformation of the molecule for training. Specifically, GEO-BERT enhances its ability to characterize molecular structures by introducing three different positional relationships: atom-atom, bond-bond, and atom-bond. By benchmarking study, GEO-BERT has demonstrated optimal performance on multiple benchmarks. We also performed prospective study to validate the GEO-BERT model, with screening for DYRK1A inhibitors as a case. Two potent and novel DYRK1A inhibitors (IC(50): <1 μM) were ultimately discovered. Taken together, we have developed an open-source GEO-BERT model for molecular property prediction (https://github.com/drug-designer/GEO-BERT) and proved its practical utility in early-stage drug discovery.