Abstract
Blood oxygen saturation (SpO(2)) is an essential physiological parameter for evaluating a person's health. While conventional SpO(2) measurement devices like pulse oximeters require skin contact, advanced computer vision technology can enable remote SpO(2) monitoring through a regular camera without skin contact. In this paper, we propose novel deep learning models to measure SpO(2) remotely from facial videos and evaluate them using a public benchmark database, VIPL-HR. We utilize a spatial-temporal representation to encode SpO(2) information recorded by conventional RGB cameras and directly pass it into selected convolutional neural networks to predict SpO(2). The best deep learning model achieves 1.274% in mean absolute error and 1.71% in root mean squared error, which exceed the international standard of 4% for an approved pulse oximeter. Our results significantly outperform the conventional analytical Ratio of Ratios model for contactless SpO(2) measurement. Results of sensitivity analyses of the influence of spatial-temporal representation color spaces, subject scenarios, acquisition devices, and SpO(2) ranges on the model performance are reported with explainability analyses to provide more insights for this emerging research field.