Enhancing Document Classification Through Multimodal Image-Text Classification: Insights from Fine-Tuned CLIP and Multimodal Deep Fusion

通过多模态图像-文本分类增强文档分类:来自微调 CLIP 和多模态深度融合的启示

阅读:1

Abstract

Foundation models excel on general benchmarks but often underperform in clinical settings due to domain shift between internet-scale pretraining data and medical data. Multimodal deep learning, which jointly leverages medical images and clinical text, is promising for diagnosis, yet it remains unclear whether domain adaptation is better achieved by fine-tuning large vision-language models or by training lighter, task-specific architectures. We address this question by introducing PairDx, a balanced dataset of 22,665 image-caption pairs spanning six medical document classes, curated to reduce class imbalance and support fair, reproducible comparisons. Using PairDx, we develop and evaluate two approaches: (i) PairDxCLIP, a fine-tuned CLIP (ViT-B/32), and (ii) PairDxFusion, a custom hybrid model that combines ResNet-18 visual features and GloVe text embeddings with attention-based fusion. Both adapted models substantially outperform a zero-shot CLIP baseline (61.18% accuracy) and a specialized model, BiomedCLIP, which serves as an additional baseline and achieves 66.3% accuracy. Our fine-tuned CLIP (PairDxCLIP) attains 93% accuracy and our custom fusion model (PairDxFusion) reaches 94% accuracy on a held-out test set. Notably, PairDxFusion achieves this high accuracy with 17 min, 55 s of training time, nearly four times faster than PairDxCLIP (65 min, 52 s), highlighting a practical efficiency-performance trade-off for clinical deployment. The testing time also outperforms the specialized model-BiomedCLIP (0.387 s/image). Our results demonstrate that carefully constructed domain-specific datasets and lightweight multimodal fusion can close the domain gap while reducing computational cost in healthcare decision support.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。