Abstract
Text-based person retrieval (TBPR) aims to search for target person images from large-scale video clips or image databases based on textual descriptions. The quality of benchmarks is critical to accurately evaluating TBPR models for their ability in relation to cross-modal matching. However, we find that existing TBPR benchmarks have a common problem, which often leads to ambiguities where multiple images of persons with different identities have very similar or even identical textual descriptions. As a consequence, although TBPR models correctly retrieve the images corresponding to a given description, such matches may be erroneously evaluated as mismatches due to the above annotation problem. We argue that the main cause of this problem is that each person image is annotated individually without reference to other similar images, making it challenging to provide distinctive descriptions for each image. To address this problem, we propose an effective and efficient annotation refinement framework to improve the annotation quality of TBPR benchmarks and thereby mitigate annotation-induced mismatches. Firstly, sets of images prone to mismatches are automatically identified by TBPR models. Then, by leveraging multimodal large language models (MLLMs), multiple images are simultaneously processed and distinctive descriptions are generated for each image. Finally, the original descriptions are replaced to improve the annotation quality. Extensive experiments on three popular TBPR benchmarks (CUHK-PEDES, RSTPReid and ICFG-PEDES) validate the effectiveness of our proposed method for improving the quality of annotations, and demonstrate that the resulting more discriminative captions can truly benefit the mainstream TBPR models. The improved annotations of these benchmarks will be released publicly.