Abstract
Aqueous solubility of a compound plays a crucial role throughout various stages of drug discovery and development. Despite numerous efforts using various machine learning models, accurately estimating aqueous solubility remains a challenge. One primary limitation is the absence of a single source, large dataset of druglike compounds for model training. Additionally, studies have highlighted the need for improvements in prediction algorithms and molecular representations. To address these challenges, the Johnson and Johnson (J&J) in-house solubility data was leveraged. Theoretical pH-solubility equations and in-house pKa prediction tools were utilized to calculate intrinsic solubility from J&J data. A multi-task graph transformer model was developed and trained on the calculated intrinsic solubility data of 13,306 compounds along with seven relevant physicochemical properties including solubility at pH 2/7, logP, and logD at three different pHs. When evaluated making use of high-quality test data, the developed model achieved a root mean square error (RMSE) of 0.61 and coefficient of determination (R(2)) of 0.60, demonstrating state-of-the-art performance in estimating intrinsic solubility for drug-like compounds.