Abstract
MOTIVATION: Crohn's disease (CD) exhibits substantial variability in response to biological therapies such as ustekinumab (UST), a monoclonal antibody targeting interleukin-12/23. However, predicting individual treatment responses remains difficult due to the lack of reliable histopathological biomarkers and the morphological complexity of tissue. While recent deep learning methods have leveraged whole-slide images (WSIs), most lack effective mechanisms for selecting relevant regions and integrating patch-level evidence into robust patient-level predictions. Therefore, a framework that captures local histological cues and global tissue context is needed to improve prediction performance. RESULTS: We propose a novel clustering-enhanced weakly supervised learning framework to predict UST treatment response from pre-treatment WSIs of CD patients. First, patches from WSIs were encoded using a pre-trained vision foundation model, and k-means clustering was applied to identify representative morphological patterns. Discriminative patches associated with treatment outcomes were selected via a DenseNet-based classifier, with Grad-CAM used to enhance interpretability. To aggregate patch-level predictions, we adopted a multi-instance learning approach, from which whole-slide features were extracted using both patch likelihood histograms and bag-of-words representations. These features were subsequently used to train a classifier for final response prediction. Experimental results on an independent test set demonstrated that our WSI-level model achieved superior predictive performance with an AUC of 0.938 (95% CI: 0.879-0.996), sensitivity of 0.951, and specificity of 0.825, outperforming baseline patch-level models. These findings suggest that our method enables accurate, interpretable, and scalable prediction of biological therapy response in CD, potentially supporting personalized treatment strategies in clinical settings. AVAILABILITY AND IMPLEMENTATION: https://github.com/caicai2526/USTAIM.