Abstract
Sequencing an entire spatial transcriptomics slide can cost thousands of dollars per assay, making routine use impractical. Focusing on smaller regions of interest (ROIs) based on adjacent H&E slides offers a practical alternative, but there is (i) no reliable way to identify the most informative areas from standard H&E images alone; and (ii) limited solutions for clinicians to prioritize the microenvironment of their own interests. Here we introduce SpatialFinder, a framework that combines a biomedical vision-language model (VLM) with a human-in-the-loop optimization pipeline to predict gene expression heterogeneity and rank high-value ROIs across routine H&E tissue slides. Evaluated across four Visium HD tissue types, SpatialFinder consistently outperforms VLM-only baselines for both diversity- and tumor-targeted ROI ranking, achieving Spearman's ρ up to 0.89 and Overlap@10% up to 78.8%, an absolute 24.9 percentage-point gain over the strongest VLM. These results demonstrate the potential of human-AI collaboration to make spatial transcriptomics more cost-effective and clinically actionable.