Abstract
Early identification of individuals at high risk for chronic diseases is crucial for prevention and intervention, yet current risk assessment tools are disease-specific, require extensive clinical data collection, and cannot provide multidisease risk profiles from a single measurement. Several protein large language models have been developed for tasks such as protein structure prediction, function prediction, and sequence design. However, none of these models can be directly applied in clinical settings to predict an individual's future disease risk. Here, we present a multimodal proteomics Transformer (Proformer) model that integrates protein expression, sequence, and function information for multidisease risk assessment. We trained Proformer using real proteomics data from 47 124 individuals from the UK Biobank to evaluate its performance in discriminating the risk of 20 common chronic diseases. Proformer achieved state-of-the-art (SOTA) performance in all 20 diseases compared with five common machine learning and deep learning models. Compared to three common clinical predictors, Proformer's 10-year discriminative performance outperforms Age + Sex model for 19 diseases, outperforms the ASCVD risk score for 16 diseases, and outperforms the panel composed of 35 clinical variables for 11 diseases. These results were replicated in the Scotland and Wales cohort from UK Biobank. In conclusion, Proformer enabled users to directly obtain a 10-year risk report for common chronic diseases by inputting their individual proteomics data.