Abstract
In computational-aided drug discovery, structure-based drug design models are computationally intensive and rely on protein structures, limiting their scalability and generalization. Additionally, many existing models suffer from inflated false-positive rates due to the scarcity of negative binding data for training. To overcome these challenges, we present ProMol_Func, a structure-free deep learning framework that integrates graph-based encodings of small molecules with protein function embeddings derived solely from amino acid sequences. By augmenting the training data set with both experimentally validated inactives and randomly selected decoys, ProMol_Func improves screening power and generalization. The model achieves state-of-the-art performance on the challenging LIT-PCBA (Library of Integrated Targeted-Panel of Cell-Based Assays) benchmark, with an enrichment factor (EF1%) of 10.9, demonstrating robust screening power in realistic assay settings. Furthermore, in a zero-shot prospective application to E. coli DnaK, a protein chaperone without actives in the training set, ProMol_Func successfully identified compounds that inhibit its ATPase activity or alter the protein's thermal stability, validating the potential of ProMol_Func for discovering binders toward novel targets. These results position ProMol_Func as an efficient and scalable alternative to traditional structure-dependent approaches in early stage hit discovery.