Abstract
INTRODUCTION: Heavy episodic drinking (HED) is a major public health concern but is often missing from surveys or measured unreliably. Predictive models offer a method to estimate HED's likelihood at the individual level in such cases. While logistic regression is commonly used, other machine learning algorithms (MLA) may offer greater accuracy and robustness. This study compares various MLAs to identify the best predictive model of HED. METHODS: Data from the 1997-2018 National Health Interview Survey were used. Six MLAs were trained and cross-validated: logistic regression, naïve bayes, k-nearest neighbour, support vector machine, random forest and XGBoost. Model performance was compared, and the SHapley Additive exPlanations (SHAP) method assessed interpretability by ranking features based on their contribution to the model's prediction. RESULTS: The probability of correctly ranking a randomly selected HED instance higher than a non-HED instance ranged from 0.85 to 0.97 (with values closer to 1 indicating better performance). XGBoost outperformed the other MLAs (sensitivity 0.80, precision 0.83, accuracy 0.92). Amongst the 11 features included in the models, average daily alcohol use and age were the most influential, as determined by SHAP values. DISCUSSION AND CONCLUSIONS: The strong discriminative ability of our models shows that even a limited number of well-chosen features can yield robust predictions, highlighting the potential of MLAs for modelling health behaviours. Integrating our models into simulation frameworks can help model HED and test scenarios, leading to effective policies. Future studies should incorporate objective sources for external validation and investigate systematic biases to improve predictive accuracy.