Abstract
This paper introduces an Multi-Hop Reasoning Framework for Composed Fashion Image Retrieval (CFIR), meticulously designed to overcome the inherent limitations posed by existing single-step and hierarchical retrieval methods when dealing with complex multimodal queries in CFIR. Traditional CFIR approaches often struggle to accurately interpret the intricate interplay between textual descriptions and visual content within fashion datasets. Our methodology harnesses the power of multi-hop reasoning to iteratively refine the retrieval process, thereby enabling a deeper and more nuanced integration of visual and textual data. This structured approach not only enhances the model's interpretative capabilities but also significantly improves its ability to discern subtle relationships between reference and target images across various modification descriptions. By incorporating multiple reasoning steps, the framework adeptly manages the compositionality inherent in fashion-related queries, resulting in superior retrieval accuracy and performance. We thoroughly validate our approach through rigorous experiments on three extensive fashion image datasets, including Fashion-IQ, Shoes, and Fashion200k. The results demonstrate marked improvements over state-of-the-art methods, highlighting the potential of our multi-hop reasoning framework to set a new benchmark in the field of image retrieval.