Preprint
Article

This version is not peer-reviewed.

Frame Selection Strategies for Video Deepfake Detection: Benchmarking Accuracy and Runtime Trade-Offs

Submitted:

20 April 2026

Posted:

21 April 2026

You are already at the latest version

Abstract
Deepfake detection from images and videos has evolved from artifact-specific convolutional baselines toward more generalizable, cross-dataset, and foundation-model-based approaches. The current work focuses on the efficiency and informativeness of frame selection itself, while keeping the downstream detectors fixed. The study compares twelve frame-selection heuristics ranging from simple baselines to landmark-aware strategies. Four pre-trained detectors were included in the present quantitative comparison: Self-Blended Images (SBI), Frequency-Enhanced Self-Lendered Images (FSBI), Generative Convolutional Vision Transformer (GenConViT), and GenD. The results show that GenD achieved the strongest average detector-level performance, with a mean frame-mean AUC of 0.9464. The best single validated configuration is GenD, yielding an AUC value of 0.9607 and a balanced accuracy of 0.9133. FSBI and SBI reached mean AUC values of 0.8953 and 0.8935, respectively, while GenD was the best general candidate. For SBI, the best validation configuration is Landmark cluster with 32 selected frames. GenD achieves the best AUC at the level of selection strategy. The present work demonstrates that inference-time frame selection is an important component of video-only deepfakes under constrained inference budgets.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated