Abstract
Unsupervised visible-infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used only for data augmentation, and its potential as an independent input modality remains unexplored. To address these shortcomings, we present a multimodule collaborative USL-VI-ReID framework that explicitly treats CA as a separate input modality. The framework combines four complementary modules. The Person-ReID Adaptive Convolutional Block Attention Module (PA-CBAM) module extracts discriminative features using a two-level attention mechanism that refines salient spatial and channel cues. The Varied Regional Alignment (VRA) module performs cross-modal regional alignment and leverages the Multimodal Assisted Adversarial Learning (MAAL) to reinforce region-level correspondence. The Varied Regional Neighbor Learning (VRNL) implements reliable neighborhood learning via multi-region association to stabilize pseudo-labels and capture local structure. Finally, the Uniform Merging (UM) module merges split clusters through alternating contrastive learning to improve cluster consistency. We evaluate the proposed method on SYSU-MM01 and RegDB. On RegDB's visible-to-infrared setting, the approach achieves Rank-1 = 93.34%, mean Average Precision (mAP) = 87.55%, and mean Inverse Negative Penalty (mINP) = 76.08%. These results indicate that our method effectively reduces modal discrepancies and increases feature discriminability. It outperforms most existing unsupervised baselines and several supervised approaches, thereby advancing the practical applicability of USL-VI-ReID.