Abstract
Plant diseases remain a major constraint on crop productivity, requiring timely and accurate diagnostic approaches to secure agricultural yields. While existing automated diagnosis methods primarily rely on image data and achieve notable results, their performance often declines in complex field environments with noise and interference. Multimodal learning provides a promising solution by integrating complementary cues from various data sources. However, the heterogeneity between plant phenotypes and other modalities, such as textual descriptions, poses a significant challenge for effective fusion. To address this issue, we propose PlantIF, a multimodal feature interactive fusion model for plant disease diagnosis based on graph learning. PlantIF comprises three key components: image and text feature extractors, semantic space encoders, and a multimodal feature fusion module. Specifically, we employ pre-trained image and text feature extractors to extract visual and textual features enriched with prior knowledge of plant diseases. Semantic space encoders then map these features into both shared and modality-specific spaces, enabling the capture of cross-modal and unique semantic information. To enhance context understanding, we design a multimodal feature fusion module to process and fuse different modal semantic information, and then extract the spatial dependency between plant phenotype and text semantics through the self-attention graph convolution network. We evaluate PlantIF on a multimodal plant disease dataset with 205,007 images and 410,014 texts, achieving 96.95 % accuracy, 1.49 % higher than existing models. These results demonstrate the potential of multimodal learning in plant disease diagnosis and highlight PlantIF's value in precision agriculture. Codes are available at https://github.com/GZU-SAMLab/PlantIF.