Abstract
The proliferation of intelligent sensor networks in urban surveillance and remote sensing has triggered the explosive growth of unstructured visual sensor data. Accurately retrieving targets from these massive streams based on complex cross-modal user intents remains a critical bottleneck for efficient intelligent perception. Composed Image Retrieval (CIR) addresses this by enabling retrieval via a multi-modal query that combines a reference image with semantic control signals. However, existing methods often struggle with abstract instructions in real-world scenarios. Consequently, models often suffer from feature distribution shifts due to focus ambiguity, as well as semantic erosion caused by highly entangled visual and textual features. To address these challenges, we propose a geometry-based Selective Orthogonal Projection Network (SOP). First, the Selective Focus Recovery module quantifies instruction uncertainty via information entropy and calibrates shifted query features to the true target distribution using structural consistency regularization. Second, to ensure data fidelity, we introduce Orthogonal Subspace Projectionand Geometric Composition Fidelity. These mechanisms employ Gram-Schmidt orthogonalization to decouple features into a constant visual base and an orthogonal modification increment, restricting semantic modifications to the null space. Extensive experiments on FashionIQ, Shoes, and CIRR datasets demonstrate that SOP significantly outperforms SOTA methods, offering a novel solution for efficient large-scale sensor data retrieval and analysis.