Abstract
Nonalcoholic fatty liver disease (NAFLD) is a slow-progressing yet complex disease with multiple pathophysiological mechanisms that make it challenging to treat. In this study, we developed a machine learning (ML)-based stacking ensemble model to predict molecules that could inhibit NAFLD progression utilizing data from animal experiments. We systematically collected 75 agents from preclinical experiments and classified them as inducers and inhibitors based on each study end point. Then, we computed 12 sets of molecular fingerprints and trained them with three baseline ML models. After that, the stacked model was trained using the predictive features from the baseline models and validated with 5-fold cross-validation (5-CV) and leave-one-out cross-validation (LOOCV). We found that the stacked model outperformed its baseline model across various evaluation metrics, thereby improving the prediction of the NAFLD inhibitory activity. Additionally, we tested the robustness and applicability domain of the stacked model, ensuring that this model delivered a trustworthy prediction. Moreover, we highlighted key molecular features, such as carboxylic, alkene, or aromatic rings, underscoring their influence on the decision-making of the stacked model. In conclusion, we have provided an effective method for improving molecular property prediction by using the stacking ensemble learning approach. Furthermore, we hosted our software in an open-access GitHub repository for further reproducibility and use in the drug discovery pipeline.