Abstract
To advance understanding of cellular heterogeneity in disease from single-cell sequencing data, we introduce residual principal-component analysis (ResidPCA), a robust method for identifying cell states that explicitly models cell-type heterogeneity. In simulations, ResidPCA achieved more than 4-fold higher accuracy than conventional PCA and over 3-fold higher accuracy than non-negative matrix factorization (NMF)-based methods in detecting states expressed across multiple cell types. Applied to single-cell RNA sequencing of light-stimulated mouse visual cortex cells, ResidPCA captured stimulus-driven variability with an accuracy more than 5-fold higher than NMF-based approaches. In single-nucleus datasets from an Alzheimer disease cohort, ResidPCA identified 44 chromatin accessibility-based states from single-nucleus ATAC-seq (snATAC-seq) and 42 transcriptional states from single-nucleus RNA-seq. Thirty snATAC-seq states were significantly enriched for Alzheimer disease heritability, often more so than established cell types such as microglia. The snATAC-seq state most significantly enriched for heritability further elucidates a recently implicated neuron-oligodendrocyte-microglial mechanistic axis, linking early amyloid production in neurons and oligodendrocytes with later microglial activation and immune response. These results highlight the ability of ResidPCA to uncover previously hidden biological variation in single-cell data and reveal disease-relevant cell states.