Abstract
Prediabetes is a major risk factor for the development of diabetes, defined by blood glucose levels that are elevated but not yet high enough to meet the diagnostic criteria for Diabetes Mellitus. This condition is often clinically "silent" yet it can already lead to negative effects on various organ systems and frequently indicates the impending onset of type 2 diabetes mellitus. This study aimed to compare a traditional statistical model, the Generalized Linear Mixed Model (GLMM), with two tree-based machine learning models, Random Forest (RF) and Generalized Mixed-Effects Random Forest (GMERF), for predicting prediabetes and identifying key risk indicators in longitudinal data. The study sample included 5361 individuals aged over 20 years, focusing on 32 different variables. The target variable was the presence of prediabetes in a longitudinal setting. We applied three models: RF, which is tree-based but does not account for repeated measurements; GLMM, which handles random effects but assumes linear relationships; and GMERF, a hybrid model that incorporates both random effects and the nonlinearity of decision trees. Model performance was evaluated using standard predictive metrics. Among the three models, GMERF achieved the highest predictive performance. The area under the ROC curve was 0.63 for RF, 0.70 for GLMM, and 0.74 for GMERF. In the GMERF model, the top five predictive variables were Waist-to-Hip Ratio (WHR), age, waist circumference, triglyceride level, and Waist-to-Height Ratio (WHtR). WHR was ranked as the most important feature in both the GMERF and RF models. All of these variables, except WHtR, were also found to be significant in the GLMM model. In longitudinal data, there is an inherent dependence between observations collected over time. By incorporating these considerations, models that account for this data structure are better equipped to handle the complexities of longitudinal data, leading to more reliable and accurate predictions.