Abstract
BACKGROUND: Hepatocellular carcinoma (HCC) exhibits high recurrence rates despite curative treatments. Microvascular invasion (MVI) is a key predictor; however, current diagnostic methods are invasive and delayed. We aimed to develop explainable machine learning models for non-invasive, preoperative MVI prediction. METHODS: We retrospectively analysed 308 patients with HCC (132 MVI-positive and 176 MVI-negative) who underwent curative hepatectomy from January 2020 to December 2023, randomly divided into training (n = 216) and validation (n = 92) cohorts (7:3 ratio). Independent risk factors were identified using univariate and multivariate logistic regression analyses. The least absolute shrinkage and selection operator (LASSO) regression was used to select predictive features. Ten machine learning models were constructed and evaluated using receiver operating characteristic (ROC), calibration, and decision curves. Model explainability was assessed using SHapley Additive exPlanations (SHAP). RESULTS: Hepatitis viral load, alpha-fetoprotein, gamma-glutamyl transferase level, tumour size, and radiogenomic venous invasion (RVI) were significant independent risk factors for MVI. LASSO regression identified 12 key features. The extreme gradient boosting (XGBoost) model performed best, with a training set area under the ROC curve (AUC) of 0.852. The accuracy, sensitivity, specificity, and F1 score were 0.792, 0.812, 0.775, and 0.776, respectively. The validation set AUC was 0.815. The accuracy, sensitivity, specificity, and F1 score were 0.750, 0.677, 0.805, and 0.700, respectively. SHAP revealed hepatitis viral load, RVI, alpha-fetoprotein, tumour size, and pseudocapsule integrity as the most influential predictors. CONCLUSION: The XGBoost model accurately predicted MVI preoperatively in HCC, with SHAP-based interpretability supporting personalised surgical decision-making. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12885-026-15839-0.