Abstract
Somatic mutation profiling is central to cancer diagnosis and treatment selection. However, most studies focus on individual actionable mutations, overlooking the broader mutational context that shapes tumor evolution and treatment response. Here, we introduce OncoBERT, a language model that learns contextual representations of somatic mutations from large-scale clinical sequencing data spanning >210,000 patients, 113 cancer types and 20 institutions. OncoBERT uncovers robust patient-specific mutational subtypes across diverse cohorts and targeted sequencing panels, revealing clinically meaningful mutation patterns that are associated with differential response to chemotherapy, targeted therapies, and immunotherapy. Importantly, integrating OncoBERT's contextual representations with clinically approved biomarkers of immunotherapy response, such as tumor mutational burden (TMB) and microsatellite instability (MSI), significantly improved prediction of clinical benefit. By further incorporating matched tumor transcriptomic profiles, we linked OncoBERT-defined mutational subtypes to distinct cancer hallmark programs and tumor microenvironment states. Together, OncoBERT provides a scalable framework for deciphering somatic mutational landscapes, enabling improved patient stratification and advancing precision oncology.