Abstract
BACKGROUND: The multifactorial mechanisms driving childhood obesity, a global public health challenge, are yet to be fully elucidated. We aimed to develop and externally validate three widely applied machine learning models alongside logistic regression in 2-18-year-old children and adolescents in Beijing and Tangshan to predict obesity risk. As a further step, we wanted to interpret the optimised model and translate it into a web-based tool to inform clinical decision-making. METHODS: We analysed data of 19 024 (training/testing) and 2410 (external validation) children and adolescents from Beijing and Tangshan, respectively. Using a set of factors including demographic, familial, socioeconomic, lifestyle, and perinatal variables, we developed four models (light gradient boosting machine, random forest, eXtreme gradient boosting (XGBoost), and logistic regression) and compared their predictive performance. After validation, we selected an optimised model and interpreted it using SHapley Additive exPlanations (SHAP) analysis. Then, we developed an online calculator with interpretable visualisations to enable real-time risk assessment. RESULTS: The XGBoost model exhibited superior performance, with an area under the receiver operating characteristic curve (AUROC) of 0.875 on the external validation set, significantly outperforming the logistic regression model (AUROC = 0.718). To identify the minimal feature subset that maintained model efficacy, we incrementally incorporated predictors in the descending order of SHAP importance values while assessing key performance metrics (accuracy, AUROC, and F-beta score). This SHAP-based analysis identified nine key predictors of childhood obesity: birth length, paternal body mass index (BMI), maternal BMI, sleep duration, physical activity, birth weight, maternal age at delivery, delivery mode, and gestational age. The deployed online tool provides individualised risk probabilities and SHAP-derived explanations. CONCLUSIONS: The XGBoost model in our study was the superior ensemble learning method for predicting childhood obesity. The digital tool integrates this model and can help clinical practitioners determine individuals' risk of childhood obesity.