A quantitative study of cytotoxic compounds using graph based descriptors and machine learning

利用基于图的描述符和机器学习对细胞毒性化合物进行定量研究

阅读:1

Abstract

Cytotoxic drugs form a heterogeneous group of antineoplastic agents commonly employed in the management of cancer and other disorders but are commonly linked with limited therapeutic indices and severe side effects. It is important to learn their physicochemical properties thus making a prediction of absorption, permeability and distribution. One of such properties is the Topological Polar Surface Area (Top_PSA), an important property of membrane transport and a popular surrogate of passive diffusion and blood brain barrier permeability. We also explored in this study whether graph-theoretrical and molecular descriptors could be a consistent predictor of RDKit/Mordred-calculated Top_PSA values of a curated dataset of 156 structure-diverse cytotoxic agents. Fifty eight descriptors were calculated and preprocessed in five pre-processing schemes such as direct fitting, PCA, robust scaling, identification and elimination of outliers, and feature selection based on the VIF by using linear, LASSO and ridge regression model. K-fold cross-validation was strictly applied to all the models. The best predictive performance of robust scaling with LASSO was the largest [Formula: see text], which proves the effectiveness of robust preprocessing and a sparsity-inducing regularization in the case of heteroscedasticity and multicollinearity in datasets with many descriptors. PCA performed similarly in terms of predictive accuracy but with a lower level of interpretability, and VIF-based pruning never did. The analysis of the non-zero LASSO coefficients indicated that the heteroatom content, the ability to form hydrogen-bonds, and a set of indices that are weighted by electronegativity were found to be significant factors contributing to Top_PSA, which is in line with the chemical definition of it being fragment-based. Overall, the given work provides valuable recommendations about preprocessing, feature selection, and model selection in QSAR working processes, and the importance of clear computational pipelines to gain correct Top_PSA predictions by the use of descriptors.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。