Abstract
OBJECTIVE: This study aimed to develop and validate stratified machine learning models for early prediction of anti-tuberculosis drug-induced liver injury (ATB-DILI) risk, targeting both a general tuberculosis (TB) treatment population and a high-risk subgroup with chronic hepatitis B (CHB) co-infection, based on real-world clinical data. METHODS: A single-center retrospective cohort study was conducted using data from 11,361 TB patients (3,787 ATB-DILI cases and 7,574 controls) and a CHB subgroup of 1,017 patients (339 cases and 678 controls) after propensity score matching. Ten machine learning algorithms, including Logistic Regression, Random Forest (RF), and XGBoost, were applied. Models were trained and validated using a 1:1 split and 10-fold cross-validation. Performance was evaluated using AUC, accuracy, sensitivity, specificity, precision, and F1-score. Model interpretability was enhanced using SHapley Additive exPlanations (SHAP). RESULTS: In the overall population, ensemble methods such as RF and XGBoost achieved AUCs of 0.960 and 0.954, respectively, on the validation set. In the CHB subgroup, RF and XGBoost performed even better, with AUCs of 0.994. Key predictors in the general population included ALT, eosinophil count, AST, and procalcitonin, while in the CHB subgroup, total bile acid, procalcitonin, ALP, and prealbumin were most influential. SHAP analysis revealed non-linear relationships between features and ATB-DILI risk, aligning with clinical knowledge. CONCLUSION: Stratified machine learning models, particularly ensemble methods, demonstrated excellent performance in predicting ATB-DILI risk and highlighted distinct injury mechanisms between general and CHB co-infected TB patients. This approach offers a clinically interpretable and accurate tool for early warning of ATB-DILI, supporting personalized risk assessment and management.