Abstract
The verification of video authenticity has become progressively more challenging with the rapid advancements in video synthesis technologies. However, current detection approaches predominantly depend on intra-frame spatial artifacts and temporal inconsistencies, restricting their capacity to fully exploit the spatio-temporal characteristics of manipulated videos. To address this problem, we propose the Spatial and Temporal Feature Aggregation Network (STFANet), which employs a two-path structure to extract spatial and temporal features independently. These extracted features are subsequently integrated to construct high-fidelity spatio-temporal representations. Additionally, we incorporate a Vision Transformer module to capture global dependencies within the feature maps, enhancing the overall feature representation. Extensive experiments validate the efficacy of the proposed approach in detecting facial forgery in videos. Performance evaluations on benchmark datasets, including FaceForensics++ and Celeb-DF, confirm the effectiveness of our method, yielding AUC scores of 0.9933 and 0.9829, respectively. Furthermore, we investigate the impact of feature aggregation at different stages on the generated feature maps, revealing significant improvements in the quality of spatio-temporal representations.