Abstract
The exponential growth of single-cell transcriptomics datasets has made it essential to integrate heterogeneous datasets for constructing large-scale single-cell reference atlases and mapping query datasets onto these references. However, this integration process is significantly hampered by batch effects, which introduce systematic biases and mask the true biological signals. Moreover, most existing integration methods are mainly limited to the latent space of highly variable genes, restricting their capacity to comprehensively correct the entire transcriptomic landscape and potentially overlooking crucial biological information encoded in genes with lower variability. We introduce scGES, a novel deep learning framework designed to effectively correct batch effects across the entire gene expression space, which leverages information from both highly and lowly variable genes. scGES consists of two main models: scGESI for data integration and scGESM for query mapping. Comprehensive analyses of real data demonstrate that scGES outperforms state-of-the-art methods in batch effect correction and biological variation conservation, thereby enhancing downstream analyses and offering broader biological insights by utilizing information from all genes.