Abstract
Population based studies are essential to evaluate the impact of the genetic and environmental determinants that influence the regulation of the human immune response. In a unique and highly selected cohort of healthy subjects, we applied a thoroughly benchmarked machine learning (ML) framework to identify their key predictive drivers following Toll-like receptor (TLR) and T-cell receptor (TCR) stimulations. Patterns of cytokine response, or immunotypes, could be observed across healthy individuals with low and high cytokine producers. Feature importance analysis revealed that TCR-induced predictions were mainly driven by genetic factors, while TLR-induced predictions were predominantly influenced by environmental and biological factors. The best performing model achieved an average correlation of 0.53 for TCR-induced cytokines and 0.27 for TLR-induced responses. Interestingly, adding biological and environmental data to genetic data improved prediction performance by 0.2 on average. However, we showed that ML models using genetic data may overestimate predictive accuracy. These findings were replicated in an independent cohort, the "Milieu Interieur" cohort. Notably, we also showed that polygenic scores for immune-mediated diseases failed to improve model performance, suggesting that the genetics underlying the disease susceptibility do not fully capture the spectrum of functional immune response variability. Our findings define distinct genetic and environmental determinants of cytokine and demonstrate both the values and limitations of ML models for modeling cytokine responses.