Abstract
Emerging pollutants are substances that have recently been discovered or brought into focus, pose ecological or human-health risks, and have not yet been included in regulatory frameworks or for which existing management measures inadequately prevent and control their risks. Synthetic chemicals play key roles in progressing human society and improving quality of life. However, these chemicals may leak into the environment through unintentional or organized emissions during the life cycles of chemical-containing products, thereby becoming potential emerging pollutants and posing ecological and human-health threats. Many new chemicals are typically used without sufficient toxicity assessments; consequently, their potential threats are difficult to predict. Hence, effective toxicity assessments of existing and emerging chemicals are required to address this situation. Toxicity testing all chemicals is expected to be very time-consuming and economically expensive. In addition, there are discrepancies between experimental results from different laboratories leading to inconsistent toxicity-screening standards for emerging pollutants, which hinders preventing and controlling emerging pollutants and explaining their toxicity mechanisms. Addressing these issues requires the development of standard alternative toxicity-testing strategies that screen emerging pollutants in a high-throughput manner. In this study, machine-learning methods were used to predict the toxicities of various compounds in the Tox21 database. The RDKit and Mordred libraries were used to process structural data(presented in SMILES format) for compounds with the aim of generating molecular descriptors for their physicochemical properties. A set of refined features was screened through information-gain calculations and variable selection, and the data were fitted using Python’s Sklearn and XGBoost libraries. Prediction models were constructed based on the screened features using seven machine-learning algorithms in order to evaluate 12 different bioactive endpoints, including datasets related to endocrine disruption, DNA damage, and oxidative stress response, among others. Model performance was evaluated by calculating the accuracy of the test set, and data availability was characterized in terms of the application domain. All training and test data were found to be located in the application domain. The model was found to highly accurately predict 12 endpoints. This study clarified the relationship between the physicochemical properties of chemicals and nuclear receptor activity, and developed corresponding software tools. The model for the 12 Tox21 datasets exhibited an average area under the curve(AUC) of 0.84, and delivered better prediction performance than other participating models. Further insight into toxicological mechanisms was obtained through feature-importance analysis using Shapley Additive exPlanations(SHAPs). The octanol-water partition coefficient(log P), molecular topology, and ZMIC and piPC descriptors were identified as key parameters for predicting toxicity; these descriptors elucidate the relationship between chemical structure and biological interaction, thereby providing mechanistic explanations for compound toxicities. For example, high log P values are associated with high cell membrane permeability, which facilitates interactions between intracellular targets and endocrine receptors. The study also developed user-friendly quantitative structure-activity relationships(QSAR) prediction software. Designed for accessibility, this software enables researchers and policymakers to input compound structures in SMILES format and predict their toxicities without the need for specialized machine-learning expertise. The software automatically generates descriptors and predicts whether the input compounds are toxic or not. This study contributes to in silico methods that replace animal testing in future toxicity studies by integrating advanced machine-learning and interpretation methods. The predictive model and accompanying software enable the rapid screening of emerging pollutants and provide guidance for designing safer chemicals. These contributions are critical for advancing environmental safety and public health in the face of expanding chemical inventories.