Abstract
Human physiological signals collected through wearable devices enable a range of applications, including biometric authentication. Prior studies have demonstrated the potential of using physiological signals to uniquely identify individuals, but their validity in real-world scenarios remains limited. Most existing work relies on controlled experimental settings, small datasets, short-term evaluations, and the absence of unseen-user testing-factors that tend to produce overly optimistic performance estimates. Although recent research highlights the need for broader benchmarking and reproducible protocols, systematic evaluations remain scarce. In this study, we assess the reliability of photoplethysmography (PPG)-based biometric methods. We replicate two published approaches and introduce a feature-based method as a baseline, evaluating all three under multiple conditions. Our results show that while these methods perform well in laboratory datasets, their effectiveness declines substantially in real-world environments, where signal variability, larger user populations, and temporal separation between training and testing challenge current systems. To address these issues, we propose guidelines for the robust evaluation of PPG-based biometrics, emphasizing real-world and longitudinal datasets, temporal splits, unseen-user assessments, and transparent reporting. Although developed for PPG, these recommendations generalize to other physiological biometrics and aim to improve the reliability and reproducibility of future research.