Cross-Modal Data Fusion via Vision-Language Model for Crop Disease Recognition

基于视觉语言模型的跨模态数据融合在作物病害识别中的应用

阅读:1

Abstract

Crop diseases pose a significant threat to agricultural productivity and global food security. Timely and accurate disease identification is crucial for improving crop yield and quality. While most existing deep learning-based methods focus primarily on image datasets for disease recognition, they often overlook the complementary role of textual features in enhancing visual understanding. To address this problem, we proposed a cross-modal data fusion via a vision-language model for crop disease recognition. Our approach leverages the Zhipu.ai multi-model to generate comprehensive textual descriptions of crop leaf diseases, including global description, local lesion description, and color-texture description. These descriptions are encoded into feature vectors, while an image encoder extracts image features. A cross-attention mechanism then iteratively fuses multimodal features across multiple layers, and a classification prediction module generates classification probabilities. Extensive experiments on the Soybean Disease, AI Challenge 2018, and PlantVillage datasets demonstrate that our method outperforms state-of-the-art image-only approaches with higher accuracy and fewer parameters. Specifically, with only 1.14M model parameters, our model achieves a 98.74%, 87.64% and 99.08% recognition accuracy on the three datasets, respectively. The results highlight the effectiveness of cross-modal learning in leveraging both visual and textual cues for precise and efficient disease recognition, offering a scalable solution for crop disease recognition.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。