Abstract
BACKGROUND: Atlas-level single-cell investigations elucidate disease pathogenesis and progression. Accurate interpretation of phenotype-related single-cell data necessitates pre-defining cell subtypes and identifying their abundance variations. However, batch correction and clustering resolution biases can impact this interpretation. To overcome these challenges, an end-to-end integrative approach that combines both cell- and gene-level information is needed to more accurately connect single-cell characteristics to clinical phenotypes. METHODS: We developed scPhase, a deep learning framework using attention-based multiple instance learning (AMIL). It treats each patient sample as a bag of single cells, learning a comprehensive representation from their gene expression profiles. By incorporating a Mixture-of-Experts (MoE) aggregation layer, it predicts clinical phenotypes that generalize across patient cohorts. Furthermore, it includes an interpretability framework that uses cellular attention and gene attribution scores to pinpoint the key cell profiles that drive its predictions. RESULTS: We evaluated scPhase across diverse single-cell disease atlases, covering COVID-19 infection, aging, neurodegeneration, and oncology, using single-cell data from peripheral blood mononuclear cells (PBMCs), brain, and tumor tissues. The model consistently outperforms baselines in classifying diverse clinical phenotypes, achieving area under the curve (AUC) scores of 0.895 for COVID-19, 0.840 for Alzheimer’s disease, and 0.951 and 0.962 for lung and colorectal cancers. It shows robust performance in age regression with a Pearson correlation coefficient (PCC) of 0.87. The model’s interpretability framework effectively pinpointed clinically relevant cell populations, enhancing its utility in identifying disease-specific cellular signatures. CONCLUSIONS: scPhase offers an interpretable supervised learning framework for single-cell data, accurately predicting sample-level clinical phenotypes while uncovering key biological mechanisms. Furthermore, it can be readily adapted for broader atlas-level clinical phenotype analyses. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13073-026-01598-x.