Abstract
Spatial transcriptomics (ST) technologies have significantly advanced our ability to discern gene expression patterns within intact tissue structures, enabling unprecedented insights into cellular heterogeneity and tissue architecture. However, accurately determining cell-type proportions within spatially aggregated transcriptomic spots remains challenging due to inherent granularity discrepancies, batch effects, and spatial heterogeneity. To address these challenges, we introduce S$^{2}$potAE, a novel spatial spot autoencoder framework that integrates gene expression data, spatial coordinates, and morphological features from histology images for precise spot-level deconvolution. S$^{2}$potAE employs a multilevel feature aggregation strategy, systematically extracting and fusing spatially-aware features through a graph-based spatial encoder and perceptual image embeddings from histological patches. Furthermore, an auxiliary pathological classification task enhances biological relevance and model interpretability. Comprehensive benchmarking across multiple simulated and real datasets-including human breast cancer, mouse brain anterior, and human dorsolateral prefrontal cortex-demonstrates that S$^{2}$potAE consistently surpasses state-of-the-art methods in accuracy, robustness, and biological interpretability. Our approach effectively resolves complex cellular compositions, accurately identifies tumor boundaries, and captures nuanced cell-type distributions, significantly enhancing the utility of ST in biological research and clinical applications.