Abstract
Metabolite alterations are linked to diseases, yet large-scale untargeted metabolomics remains constrained by challenges in signal detection and integration of diverse datasets for developing pre-trained generative models. Here, we introduce mzLearn, a data-driven MS¹ signal-detection and alignment method that runs from mzML files without user-set parameters. Across 15 public datasets, mzLearn detects 11,442 signals on average vs 7,100 (XCMS) and 4,655 (ASARI), with higher TP (89.0% vs 77.4% vs 49.6%) and lower FP (12.5% vs 17.3% vs 18.8%), while correcting instrument drifts across large cohorts without experimental QC samples. mzLearn detected 2,736 robust metabolite signals from 22 public studies (20,548 blood samples), enabling the development of pre-trained variational autoencoder for untargeted metabolomics. Learned metabolite representations reflected demographic data and when fine-tuned on unseen renal cell carcinoma data, improved risk stratification and overall survival predictions, while feature-importance analysis (SHAP) highlighted biologically plausible lipid and carnitine signals. By producing a consistent, high-quality MS¹ feature matrix at scale, mzLearn paves the way for developing pre-trained foundation models for untargeted metabolomics.