Abstract
BACKGROUND/OBJECTIVES: Ascertaining risk of and prognostication tools for metabolic dysfunction-associated steatotic liver disease (MASLD) remain suboptimal. Using the UK Biobank, we investigated the role of multivariate supervised learning in predicting MASLD risk and the presence of disease subgroups in stratifying the risk of comorbid outcomes. METHODS: Two ground truth definitions for MASLD were derived using magnetic resonance imaging–proton density fat fraction (MRI-PDFF) and electronic health records (EHRs). A training and validation set was created (60:40 split), using multiple imputation for missing data (mean missingness: 2.05%). Variable importance analyses of partial least squares-discriminant analysis (PLS-DA) and random forest (RF) were used to select important clinical, genetic, haematological, biochemical, and metabolomic features of MASLD, which were subsequently used to train models to classify disease risk. The presence of MASLD subgroups were explored by K-means clustering and the relationship with comorbid disease outcomes were investigated. RESULTS: Three thousand seven hundred ninety-nine and 2552 individuals were available for analysis for the two ground truth definitions. A total of 49 features were selected to train our final models. Model performance based on area under receiver operating characteristics for PLS-DA, RF, support vector machine linear kernel, and logistic regression were all above ≥0.80 (MRI-PDFF definition) and ≥0.85 (MRI-PDFF and EHR definition) at predicting MASLD. Unsupervised clustering of the features in MASLD identified two risk subgroups with a ∼3-fold difference in ischaemic stroke, hypertension, and extrahepatic malignancy, a ∼2-fold difference in hyperlipidaemia and T2DM, and a >1.5-fold difference in ischaemic heart disease and all-cause mortality. CONCLUSIONS: Using multivariate supervised learning models, we have identified features of MASLD in UK Biobank participants that can be used to predict the risk of comorbid diseases and outcomes. This provides insights into MASLD pathophysiology and highlights the potential of learning models for case identification and prognostication. Further external validation is required to assess applicability of the predictive features and for model optimisation.