Abstract
OBJECTIVES: This study aimed to develop an accurate prediction model for the risk of Non-alcoholic fatty liver disease (NAFLD) using the random survival forests (RSF), and to investigate the distribution of NAFLD risk with time. METHODS: This retrospective cohort study included subjects who had annual health checkups from 1 January 2021 to 31 December 2024. The hold-out strategy, that all the subjects were divided into a training set and a test set, was employed to develop and evaluate our models. Important predictors were then extracted from all the candidate variables using the LASSO regression on the training set. Two prediction models were constructed using the Cox model and the RSF model. Feature importance and their 95% CIs were calculated using the VIMP with bootstrap resampling. The integrated area under the curve (iAUC), the time-dependent area under the curve (tAUC), the integrated Brier score (iBS), and the time-dependent prediction error (PE) were used to evaluate the discrimination and calibration of our models. RESULTS: A total of 18,250 patients fulfilled the criteria, and 14 predictors were extracted through the LASSO regression for the next model development. The RSF model showed exceptional discrimination (iAUC of 0.856) and calibration (iBS of 0.116) compared to the Cox model (iAUC of 0.759 and iBS of 0.148). Based on the RSF model predictions, subjects were stratified into the high- and low-risk groups with significant differences, with a mean NAFLD-free time of 20.86 and 36.76 months (P <.0001), respectively. CONCLUSIONS: In this study, the RSF prediction model for the risk of NAFLD was developed, which outperformed the traditional Cox model, achieved remarkable risk stratification for NAFLD, and provided novel insights into the distribution of NAFLD risk with time.