Abstract
Cross-modal retrieval, particularly image-text matching, is crucial in multimedia analysis and artificial intelligence, with applications in intelligent search and human-computer interaction. Current methods often overlook the rich semantic relationships between labels, leading to limited discriminability. We introduce a Two-Layer Graph Convolutional Network (L2-GCN) to model label correlations and a hybrid loss function, Circle-Soft, to enhance alignment and discriminability. Extensive experiments on the NUS-WIDE, MIRFlickr, and MS-COCO datasets demonstrate the effectiveness of our approach. The results show that the proposed method consistently outperforms current baselines, achieving accuracy improvements of 0.5%, 0.5%, and 1.0%, respectively. The source code is accessible via https://github.com/buzzcut619/L2-GCN-CIRCLE-SOFT.