Abstract
Cytotoxic drugs form a heterogeneous group of antineoplastic agents commonly employed in the management of cancer and other disorders but are commonly linked with limited therapeutic indices and severe side effects. It is important to learn their physicochemical properties thus making a prediction of absorption, permeability and distribution. One of such properties is the Topological Polar Surface Area (Top_PSA), an important property of membrane transport and a popular surrogate of passive diffusion and blood brain barrier permeability. We also explored in this study whether graph-theoretrical and molecular descriptors could be a consistent predictor of RDKit/Mordred-calculated Top_PSA values of a curated dataset of 156 structure-diverse cytotoxic agents. Fifty eight descriptors were calculated and preprocessed in five pre-processing schemes such as direct fitting, PCA, robust scaling, identification and elimination of outliers, and feature selection based on the VIF by using linear, LASSO and ridge regression model. K-fold cross-validation was strictly applied to all the models. The best predictive performance of robust scaling with LASSO was the largest [Formula: see text], which proves the effectiveness of robust preprocessing and a sparsity-inducing regularization in the case of heteroscedasticity and multicollinearity in datasets with many descriptors. PCA performed similarly in terms of predictive accuracy but with a lower level of interpretability, and VIF-based pruning never did. The analysis of the non-zero LASSO coefficients indicated that the heteroatom content, the ability to form hydrogen-bonds, and a set of indices that are weighted by electronegativity were found to be significant factors contributing to Top_PSA, which is in line with the chemical definition of it being fragment-based. Overall, the given work provides valuable recommendations about preprocessing, feature selection, and model selection in QSAR working processes, and the importance of clear computational pipelines to gain correct Top_PSA predictions by the use of descriptors.