Abstract
Deepfake detection faces increasing challenges since the fast growth of generative models in developing massive and diverse Deepfake technologies. Recent advances rely on introducing heuristic features from spatial or frequency domains rather than modeling general forgery features within backbones. To address this issue, we turn to the backbone design with two intuitive priors from spatial and frequency detectors, i.e., learning robust spatial attributes and frequency distributions that are discriminative for real and fake samples. To this end, we propose an efficient network for face forgery detection named MkfaNet, which consists of two core modules. For spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects organ features extracted by multiple convolutions for modeling subtle facial differences between real and fake faces. For the frequency components, we propose a Multi-Frequency Aggregator to process different bands of frequency components by adaptively reweighing high-frequency and low-frequency features. Comprehensive experiments on seven popular Deepfake detection benchmarks demonstrate that MkfaNet achieves an AUC of 0.9591 in within-domain evaluations and 0.7963 in cross-domain evaluations, outperforming several state-of-the-art methods while maintaining high computational efficiency. Results confirm that MkfaNet is effective and efficient in detecting forgery, offering enhanced robustness against diverse Deepfake manipulations. Our code is available at https://github.com/GGshawn/MkfaNet.