A Vision-Language-Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation Study

基于视觉语言引导的多模态融合网络在声门癌早期诊断中的应用：模型开发与验证研究

阅读：1

作者：Jin,Zhaohui,Shuai,Yi,Li,Yun,Chen,Mianmian,Liu,Yumeng,Lei,Wenbin,Fan,Xiaomao

期刊：	JMIR Medical Informatics	影响因子：	3.800
时间：	2025	起止号：	2025 Oct 8;13:e74902
doi：	10.2196/74902	研究方向：	肿瘤

Abstract

BACKGROUND: Early diagnosis and intervention in glottic carcinoma (GC) can significantly improve long-term prognosis. However, the accurate diagnosis of early GC is challenging due to its morphological similarity to vocal cord dysplasia, with the difficulty further exacerbated in medically underserved areas. OBJECTIVE: This study aims to address the limitations of existing technologies by designing a vision-language multimodal model, providing a more efficient and accurate early diagnostic method for GC. METHODS: The data used in this study were sourced from the information system of the First Affiliated Hospital of Sun Yat-sen University, comprising laryngoscopy reports and 5796 laryngoscopic images from 404 patients with glottic lesions. We propose a vision-language-guided multimodal fusion network (VLMF-Net) based on a large vision-language model for the early automated diagnosis of GC. The text processing module of this model uses the pretrained Large Language Model Meta AI (LLaMa) to generate text vector representations, while the image processing module uses a pretrained vision transformer to extract features from laryngoscopic images, achieving cross-modal alignment through the Q-Former module. By leveraging a feature fusion module, deep integration of text and image features is achieved, ultimately enabling classification diagnosis. To validate the model's performance, the study selected contrastive language-image pretraining (CLIP), bootstrapping language-image pretraining with frozen image encoders and large language models (BLIP-2), a large-scale image and noisy-text embedding (ALIGN), and vision-and-language transformer (VILT) as baseline methods for experimental evaluation on the same dataset, with comprehensive performance assessment conducted using accuracy, recall, precision, F1-score, and area under the curve. RESULTS: We found that on the internal test set, the VLMF-Net model significantly outperformed existing methods with an accuracy of 77.6% (CLIP: 70.5%; BLIP-2: 71.5%; ALIGN: 67.3%; and VILT: 64.3%), achieving a 6.1-percentage point improvement over the best baseline model (BLIP-2). On the external test set, our method also demonstrated robust performance, achieving an accuracy of 73.9%, which is 4.6 percentage points higher than the second-best model (BLIP-2: 69.3%). This indicates that our model surpasses these methods in the early diagnosis of GC and exhibits strong generalization ability and robustness. CONCLUSIONS: The proposed VLMF-Net model can be effectively used for the early diagnosis of GC, helping to address the challenges in its early detection.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。