Abstract
Microbial communities are integral to human health, biotechnology, and environmental systems, yet their analysis is hindered by data heterogeneity and batch effects across studies. Traditional supervised methods often fail to capture universal patterns, limiting their utility in diverse contexts. Here, we present the Microbial General Model (MGM), the first large-scale foundation model for microbiome analysis, pretrained on 260,000 samples using transformer-based language modeling. MGM employs self-attention mechanisms and autoregressive pre-training to learn contextualized representations of microbial compositions, enabling robust transfer learning for downstream tasks. Benchmark evaluations demonstrate MGM's superior performance over conventional methods (average ROC-AUC = 0.99 vs. 0.68-0.97) in microbial community classification, with enhanced generalization across geographic regions. MGM also captures spatial and temporal microbial dynamics, as evidenced by its application to a longitudinal infant cohort, where it delineated delivery mode-specific microbiome trajectories and identified keystone genera such as Bacteroides and Bifidobacterium in vaginal deliveries and Haemophilus in cesarean deliveries. Furthermore, through prompt-guided generation, MGM produced realistic microbial profiles conditioned on disease labels. By integrating self-supervised learning with domain-specific fine-tuning, MGM advances the scalability and precision of microbiome analyses, offering a unified framework for diagnostics, ecological studies, and therapeutic discovery.