Abstract
Text-to-image person re-identification (T2I-ReID) aims to retrieve pedestrians from images/videos based on textual descriptions. However, most methods implicitly assume that training image-text pairs are correctly aligned, while in practice, issues such as under-correlated and falsely correlated image-text pairs arise due to coarse-grained text annotations and erroneous textual descriptions. To address this problem, we propose a T2I-ReID method based on noise identification and pseudo-text generation. We first extracts image-text features using the Contrastive Language-Image Pre-Training model (CLIP), then employs the token fusion model to select and fuse informative local token features, resulting in token fusion embedding (TFE) for fine-grained representations. To identify noisy image-text pairs, we apply the two-component Gaussian mixture model (GMM) to fit the per-sample loss distributions computed by the predictions of basic feature embedding (BFE) and TFE. Finally, when the noise identification tends to stabilize, we employ a multimodal large language model (MLLM) to generate pseudo-texts that replace the noisy text, facilitating learning more reliable visual-semantic associations and cross-modal alignment under noisy conditions. Extensive experiments on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the effectiveness of our proposed model and the good compatibility with other baselines.