Abstract
Gender recognition from pedestrian imagery is acknowledged by many as a quasi-solved problem, yet most existing approaches evaluate performance in a within-domain setting, i.e., when the test and training data, though disjoint, closely resemble each other. This work provides the first exhaustive cross-domain assessment of six architectures considered to represent the state of the art: ALM, VAC, Rethinking, LML, YinYang-Net, and MAMBA, across three widely known benchmarks: PA-100K, PETA, and RAP. All train/test combinations between datasets were evaluated, yielding 54 comparable experiments. The results revealed a performance split: median in-domain F1 approached 90% in most models, while the average drop under domain shift was up to 16.4 percentage points, with the most recent approaches degrading the most. The adaptive-masking ALM achieved an F1 above 80% in most transfer scenarios, particularly those involving high-resolution or pose-stable domains, highlighting the importance of strong inductive biases over architectural novelty alone. Further, to characterize robustness quantitatively, we introduced the Unified Robustness Metric (URM), which integrates the average cross-domain degradation performance into a single score. A qualitative saliency analysis also corroborated the numerical findings by exposing over-confidence and contextual bias in misclassifications. Overall, this study suggests that challenges in gender recognition are much more evident in cross-domain settings than under the commonly reported within-domain context. Finally, we formalize an open evaluation protocol that can serve as a baseline for future works of this kind.