Abstract
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity in complex biological systems. However, analyzing and integrating scRNA-seq data poses unique computational challenges due to sparsity, high variability, and technical batch effects. Here, we propose a novel framework called scDecorr for robust representation learning and data integration for scRNA-seq analysis. Our approach leverages the idea of feature decorrelation-based self-supervised learning (SSL) to obtain efficient low-dimensional representations of individual cells without relying on cell-type annotations. By maximizing similarity among distorted embeddings while decorrelating their components, scDecorr captures the biological signature while eliminating technical noise. Furthermore, scDecorr incorporates unsupervised domain adaptation to bridge the gap between batches with different distributions, enabling effective integration of scRNA-seq data from diverse sources. Our framework achieves domain-invariant representations by learning cell embeddings independently across domains and employing domain-specific batch normalization. We evaluate scDecorr on a variety of single-cell datasets and demonstrate its ability to integrate batches without losing the inherent biological variance, thereby facilitating optimal clustering. The representations generated by scDecorr also exhibit robustness in label transfer tasks, allowing for effective transfer of cell-type labels from reference to query datasets. Overall, scDecorr offers a powerful tool for efficient analysis and integration of large and complex scRNA-seq datasets, advancing our understanding of cellular processes and disease mechanisms. The code is available here https://github.com/hayatlab/scdecorr .