Abstract
Advances in spatial transcriptomics have enabled high-resolution mapping of tissue architecture at the molecular level, yet integrating its multi-modal data remains challenging. Here, we present stGCL, a framework for accurate and robust integration of gene expression, spatial coordinates, and histological features. stGCL employs a histology-based Vision Transformer to extract morphological features and a multi-modal graph autoencoder with contrastive learning for cross-modal fusion. In addition, we introduce a spatial coordinate correction and registration strategy to support multi-slice integration. We demonstrate that stGCL reliably identifies spatial domains, integrates vertical and horizontal tissue slices, and highlight its generalizability across platforms and resolutions. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-025-03896-w.