Abstract
Generative deep learning models have demonstrated significant potential in designing drug-like molecules. However, medicinal chemistry typically requires generating analogues that combine structural similarity with scaffold hopping, which is the replacement of molecular scaffolds while retaining biological relevance. To address this, we introduce ANNalog, a transformer-based sequence-to-sequence generative model trained on pairs of molecules extracted from the same bioactivity assay in a paper as recorded in ChEMBL33. The dataset was constructed based on the idea that molecules tested within the same assay can be considered analogues in medicinal chemistry space. Paired molecules were encoded as Simplified Molecular Input Line Entry System strings, and Levenshtein distance-guided alignment was applied to maximise intrapair string similarity; this preprocessing step was found to markedly enhance model performance. ANNalog has the ability to produce structurally similar analogues involving minor modifications, such as substituent replacements, as well as the ability to perform scaffold hopping, generating structurally distinct yet chemically relevant analogues. Scaffold-hopping capability was validated using manually curated molecule pairs and further confirmed through a case study involving orexin-2 receptor antagonists from patent literature. When the generation process was constrained using ANNalog's prefix control feature, approximately 25% of the known scaffolds from the patent set were successfully recovered by the model, illustrating enhanced performance under user-guided conditions. Scientific Contribution: This study introduces ANNalog, a generative model trained using pairs of molecules synthesised and tested together within the same medicinal chemistry project. Unlike previous models trained on pairs of molecules selected according to similarity measures, ANNalog successfully generates not only structurally similar molecules but also diverse scaffold-hopping transformations that have precedent in the medicinal chemistry literature.