InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation.

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs). The framework dynamically constructs and iteratively refines query representations conditioned on natural language instructions and guided by user feedback, thereby enabling the system to effectively infer and adapt to implicit retrieval intent. Extensive experiments on standard multimodal retrieval benchmarks demonstrate that our method significantly improves retrieval accuracy and adaptability, outperforming fixed-query baselines and showing enhanced cross-modal alignment and generalization across diverse retrieval tasks.

期刊：	Sensors	影响因子：	3.500
时间：	2025	起止号：	2025 Aug 21; 25(16):5195
doi：	10.3390/s25165195

InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation.

特别声明