Abstract
With the rapid growth of multimodal data on the web, cross-modal retrieval has become increasingly important for applications such as multimedia search and content recommendation. It aims to align visual and textual features to retrieve semantically relevant content across modalities. However, this remains a challenging task due to the inherent heterogeneity and semantic gap between image and text representations. This paper proposes mutual contextual relation-guided dynamic graph network that integrates ViT, BERT and GCNN to construct a unified and interpretable multimodal representation space for image-text matching. The ViT and BERT features are structured into a dynamic cross-modal feature graph (DCMFG), where nodes represent image features and text features, and edges are dynamically updated based on mutual contextual relations that is neighboring relations extracted using KNN. The attention-guided mechanism refines graph connections, ensuring adaptive and context-aware alignment between modalities. The mutual contextual relation helps in identifying relevant neighborhood structures among image and text nodes, enabling the graph to capture both local and global associations. Meanwhile, the attention mechanism dynamically weights edges, enhancing the propagation of important cross-modal interactions. This emphasizes meaningful connections or edges among different modality nodes, improving the interpretability by revealing how image regions and text features interact. This approach overcome the limitations of existing models that rely on static feature alignment and insufficient modeling of contextual relationships. Experimental results on benchmark datasets MirFlickr-25K and NUS-WIDE are used to demonstrate significant performance improvements over state-of-the-art methods in precision and recall, validating its effectiveness for cross-modal retrieval.