In drug discovery, prioritizing compounds for experimental testing is a critical task that can be optimized through active learning by strategically selecting informative molecules. Active learning typically trains models on labeled examples alone, while unlabeled data is only used for acquisition. This fully supervised approach neglects valuable information present in unlabeled molecular data, impairing both predictive performance and the molecule selection process. We address this limitation by integrating a transformer-based BERT model, pretrained on 1.26 million compounds, into the active learning pipeline. This effectively disentangles representation learning and uncertainty estimation, leading to more reliable molecule selection. Experiments on Tox21 and ClinTox datasets demonstrate that our approach achieves equivalent toxic compound identification with 50% fewer iterations compared to conventional active learning. Analysis reveals that pretrained BERT representations generate a structured embedding space enabling reliable uncertainty estimation despite limited labeled data, confirmed through Expected Calibration Error measurements. This work establishes that combining pretrained molecular representations with active learning significantly improves both model performance and acquisition efficiency in drug discovery, providing a scalable framework for compound prioritization. SCIENTIFIC CONTRIBUTION: We demonstrate that high-quality molecular representations fundamentally determine active learning success in drug discovery, outweighing acquisition strategy selection. We provide a framework that integrates pretrained transformer models with Bayesian active learning to separate representation learning from uncertainty estimation-a critical distinction in low-data scenarios. This approach establishes a foundation for more efficient screening workflows across diverse pharmaceutical applications.
Molecular property prediction using pretrained-BERT and Bayesian active learning: a data-efficient approach to drug design.
阅读:4
作者:Masood Muhammad Arslan, Kaski Samuel, Cui Tianyu
| 期刊: | Journal of Cheminformatics | 影响因子: | 5.700 |
| 时间: | 2025 | 起止号: | 2025 Apr 23; 17(1):58 |
| doi: | 10.1186/s13321-025-00986-6 | ||
特别声明
1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。
2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。
3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。
4、投稿及合作请联系:info@biocloudy.com。
