Abstract
Face masking, face swapping, and face animation are downstream activities that can benefit from face parsing, in which a face image is divided into many semantic regions. Due to the widespread use of cameras, obtaining facial images has become increasingly straightforward. However, pixel-by-pixel manual labeling takes a lot of time and effort, that allows us to investigate approaches based on unlabled data. In this paper, we propose a novel hybrid transfer learning-based approach for face parsing. First, patches are randomly masked in the central region of the face images. The method then proceeds in two stages: a pre-training stage and a fine-tuning stage. In the pre-training stage, the model is able to represent some basic facial features through unlabeled data. Then, the model is adjusted for the face parsing task on a small labeled dataset in the fine-tuning stage. Experimental results on the test sets show that the model can significantly reduce labeling costs. Furthermore, the proposed method outperforms the baseline by 2.9%, 2.16%, and 1.18% of mIoU with 0.5%, 1%, and 10% labeled data, respectively, on the LaPa dataset. Moreover, experimental results on the CelebAMask-HQ test dataset reveal that the masked transfer learning-based approach significantly outperforms the baseline for various labeling samples of the training data.