Abstract
The HER2 overexpression serves as a crucial biomarker in breast cancer diagnosis and treatment decision-making. In-situ hybridization (ISH) is widely employed for determination HER2 gene amplification. We introduce the HER2-SISH40x dataset based on VENTANA HER2 Dual probe ISH (DISH) staining, consisting of image patches curated from 50 whole slide images (WSIs) acquired at 40 × magnification using a 3DHistech Pannoramic DESK scanner. Expert pathologists annotated 237 regions of interest (ROIs) categorized as breast cancer with HER2 amplification (HER2/CEP17 >2 and HER2 signals/cancer cell >4), or breast cancer without HER2 amplification (HER2/CEP17 <2 and HER2 signals/cancer cell <4), with an additional 300 Normal ROIs extracted from both Amplified and Non-Amplified WSIs. This dataset is suitable for developing and evaluating computational pathology methods, particularly deep learning models for automated HER2 scoring, segmentation, and classification. The dataset has been successfully applied in several research studies [1-3], addressing challenges such as color normalization, cancer-region classification, and automated HER2 signal quantification. Research ethical clearance was obtained from the University Malaya Medical Center. The HER2-SISH40x dataset offers a valuable resource for advancing digital pathology workflows and personalized breast cancer diagnosis.