Abstract
In the field of talking face generation, two-stage audio-based generation methods have attracted significant research interest. However, these methods still face challenges in achieving lip-audio synchronization during face generation, as well as issues with the discontinuity between the generated parts and original face in rendered videos. To overcome these challenges, this paper proposes a two-stage talking face generation method. The first stage is the landmark generation stage. A dynamic convolutional transformer generator is designed to capture complex facial movements. A dual-pipeline parallel processing mechanism is adopted to enhance the temporal feature correlation of input features and the ability to model details at the spatial scale. In the second stage, a dynamic Gaussian renderer (adaptive Gaussian renderer) is designed to realize seamless and natural connection of the upper- and lower-boundary areas through a Gaussian blur masking technique. We conducted quantitative analyses on the LRS2, HDTF, and MEAD neutral expression datasets. Experimental results demonstrate that, compared with existing methods, our approach significantly improves the realism and lip-audio synchronization of talking face videos. In particular, on the LRS2 dataset, the lip-audio synchronization rate was improved by 18.16% and the peak signal-to-noise ratio was improved by 12.11% compared to state-of-the-art works.