Abstract
Somatic mutational signatures imprint the history of exogenous exposures and endogenous processes on the genome, offering critical insights into pathologic etiology and disease risk. However, accurate signature decomposition at the single-sample level is still challenging when mutation burden is low, sampling noise is high, and candidate catalogs are large and redundant. Here, we present SigFormer, a set-conditioned transformer framework designed to facilitate robust somatic mutation analysis without reliance on large cohorts. By leveraging a cross-attention mechanism between customized reference input and sample mutation profile, SigFormer improves exposure recovery and detection accuracy compared with likelihood-driven refitting (MuSiCal) with the largest performance gains in high-noise and overcomplete settings. On PCAWG genomes, SigFormer preserves major tissue-level structure while sensitively and accurately capturing cooccurrence of low-abundance signatures but without the need for tumor-type-specific gating. In low-burden normal-tissue datasets spanning clonal expansion and microdissection studies, SigFormer maintains the high accuracy and recovers stable tissue-dependent patterns of SBS1/SBS5/SBS40a, pointing to underlying tissue-specific mutagenic heterogeneity in normal tissues. Finally, SigFormer quantifies an explicit unattributable residual component when the catalogue is incomplete, preventing forced allocation into flexible flat signatures and providing a useful signal for downstream analyses.