Abstract
BACKGROUND AND OBJECTIVE: Accurate classification of intestinal polyps is crucial for preventing colorectal cancer but is hindered by visual similarity among subtypes and endoscopic variability. While deep learning aids in diagnosis, single-modal models face efficiency-accuracy trade-offs and ignore pathological semantics. We propose a multimodal framework that integrates endoscopic images with structured pathological descriptions to bridge this gap. METHODS: We propose LPA-Tuning CLIP, which incorporates three key innovations: replacing CLIP's instance-level contrastive loss with cross-modal projection matching (CMPM) with ID loss to explicitly optimize intraclass compactness and interclass separation through label-aware image-text similarity matrices; introducing structured clinical semantic templates that encode WHO diagnostic criteria into hierarchical text prompts for consistent pathology annotations; and developing medical-aware augmentation that preserves lesion features while reducing domain shifts. RESULTS: The experimental results demonstrate that our proposed method achieves an accuracy of 85.8% and an F1 score of 0.862 on the internal test set, establishing a new state-of-the-art performance for intestinal polyp classification. CONCLUSIONS: This study proposes a multimodal polyp classification paradigm that achieves 85.8% accuracy on three-subtype classification via endoscopic image-pathology text joint representation learning, outperforming unimodal baselines by 8.7% and a multimodal baseline by 4.3%.