Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery

利用超过5000个数据集对多种机器学习算法的生物活性进行比较,以用于药物发现

阅读:1

Abstract

Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies, we and others have applied multiple machine learning algorithms and modeling metrics and, in some cases, compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and in comparison of our proprietary software Assay Central with random forest, k-nearest neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (three layers). Model performance was assessed using an array of fivefold cross-validation metrics including area-under-the-curve, F1 score, Cohen's kappa, and Matthews correlation coefficient. Based on ranked normalized scores for the metrics or datasets, all methods appeared comparable, while the distance from the top indicated that Assay Central and support vector classification were comparable. Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case. If anything, Assay Central may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central performance, although support vector classification seems to be a strong competitor. We also applied Assay Central to perform prospective predictions for the toxicity targets PXR and hERG to further validate these models. This work appears to be the largest scale comparison of these machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors, and machine learning algorithms and further refine the methods for evaluating and comparing such models.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。