Abstract
Esophageal whitish plaques are common findings in large-scale esophageal cancer screenings, requiring accurate preliminary differentiation to guide appropriate clinical management. This study presents a computer-aided diagnosis (CAD) system based on the pre-trained large-scale visual-language (VL) model BLIP for automated diagnosis and description of esophageal whitish plaques. A dataset of 13,922 endoscopic images was used for model training, and comparative experiments were conducted with multiple benchmark models, including Poolformer, Swin-Transformer, TransMSF, and ViT. The results demonstrate that our approach outperforms existing methods in terms of precision, recall, F1 score, and accuracy. Compared with LLaVA-Med, our model significantly improves keyword accuracy (K-ACC) in medical text descriptions. A human-machine competition further demonstrated that our model outperforms both senior and junior endoscopists, particularly excelling in the recall of early esophageal cancer cases. These findings suggest that integrating pre-trained VL models into CAD systems can enhance the accuracy and efficiency of esophageal whitish plaque diagnosis, reducing misdiagnoses and supporting clinical decision-making.