Submitted:
05 December 2025
Posted:
08 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Contribution
- A lightweight temporal tampering detector based on frame-difference magnitude signals and a compact 1D-CNN architecture.
- A unified approach capable of detecting frame insertion, deletion, duplication, and performing multiclass classification.
- A large-scale custom tampered dataset derived from D2-City with realistic manipulation conditions.
- A detailed ablation study analyzing kernel size, model depth, and dropout effects across all tampering types.
- A comprehensive computational efficiency analysis, demonstrating near real-time inference on CPU.
- Cross-dataset generalization experiments highlighting domain-shift challenges and directions for improvement.
2. Materials and Methods

2.1. Steps Taken to Calculate Frame Difference
- Grayscale conversion reduces computational complexity by eliminating color channels and focusing on intensity.Where It is the intensity (grayscale) of the frame, and Rt, Gt, Bt are the red, green, and blue channels of the original frame.
-
The absolute difference between two consecutive frames It and It−1 is calculated as:Where is the pixel-wise difference between the current frame and the previous frame .
-
A global threshold τ = 30 was applied to suppress low-level noise in the frame-difference maps. The value was selected heuristically and was found to effectively remove insignificant fluctuations while preserving major temporal transitions:The result is a binary image with values of 0 (insignificant change) or 255 (significant change).Summing the difference, the binary image T(Dt) is then summed to obtain a scalar value representing the total change between the two frames.Where St is the sum of all pixel values in the threshold difference image, since the pixel values are either 0 or 255, St indicates the amount of motion or scene change between the frames.This scalar value, St, is appended to the list of frame differences, which will be used as input to the classification model.The entire process results in a sequence of frame differences {S1,S2,…,ST}, where T is the total number of frames in the video. This sequence is a time series representing the changes in the video over time. The following figures visualize the frame difference magnitudes of tampered and non-tampered videos.






2.2. Model Architecture

| Layer | Input / Window | Operation & Parameters | Output Computation | Output Value | Purpose / Intuition | |
|---|---|---|---|---|---|---|
| Raw frame difference between consecutive frames. For simplicity, we considered it for 13 frames only, so we have 12 magnitude values. | [100000,120000,450000,460000,110000,90000,95000,105000,115000,95000,90000,85000] | — | — | — | The overall pixel-intensity change between consecutive grayscale frames | |
| [] | (100000-85000)/(460000-85000)=0.04, (120000-85000)/(460000-85000)=0.09 etc. | [0.04,0.09,0.96,0.963,0.06,0.01,0.03,0.07,0.10,0.03,0.01,0] | Normalized pixel-intensity change between consecutive grayscale frames. This will be used as an input. | |||
| Conv1D-1 Window 1, | [0.04,0.09,0.96,0.963,0.06] | Filter w=[0.2,-0.1,0.5,0.4,-0.2], Bias=0.05, Activation=ReLU | 0.04*0.2+0.09*(-0.1)+0.96*0.5+0.963*0.4+0.06*(-0.2)+0.05 | 0.9022 | Captures local temporal changes in pixel intensity between consecutive frames | |
| MaxPooling1D-1 | [0.9022,0.5223,0.1443,0.2413,0.116,...] | Pool size=2 | Take max in windows of 2 | [0.9022,0.5223,0.2413,...] | Reduces temporal dimension, keeps salient features | |
| Conv1D-2 Window 1, Conv1D-2 other windows are not listed in this table. | [0.9022,0.5223,0.2413,0.116,0.144] | Filter w=[0.3,-0.2,0.5,0.1,0.2], Bias=0.05, Activation=ReLU | 0.9022*0.3+0.5223*(-0.2)+0.2413*0.5+0.116*0.1+0.144*0.2+0.05 | 0.4815 | Captures higher-order temporal patterns formed by combinations of local intensity variations | |
| MaxPooling1D-2 | [0.4815,...] | Pool size=2 | Max over windows | [0.4815,...] | Summarizes features | |
| Flatten | [0.4815,...] | — | Concatenate all feature values | [0.4815,...] | Prepare for Dense layer | |
| Dense | [0.4815,...] | Units=1, Weight=[0.2], Bias=0.05, Activation=ReLU | 0.4815*0.2+0.05 | 0.1463 → ReLU=0.1463 | Combine features into tampering evidence | |
| Output Sigmoid | 0.1463 | Activation=Sigmoid | 1/(1+exp(-0.1463)) | 0.5365 | Probability of tampering (threshold 0.5 → tampered) |
2.3. Dataset and Creating Custom Dataset
2.4. Software and Hardware Environment
- Python 3.12.10, OpenCV, NumPy, scikit-learn, TensorFlow 2.20.0, and Keras
- Processor: Intel® Core™ i7, Memory: 8 GB RAM, Operating System: Windows 10, and GPU: None; all training and inference were performed on CPU
3. Results
3.1. Detecting Frame Deletion Tampering

3.2. Detecting Frame Insertion Tampering
3.3. Detecting Frame Duplication Tampering
3.4. Multiclass Classification
3.5. Cross Dataset Experimentations
3.6. Ablation Study
3.6.1. Depth (Number of Conv1D Blocks) is the Dominant Factor Affecting Accuracy
3.6.2. Kernel Sizes 5 and 7 Outperform Kernel Size 3 Across All Datasets
- Frame deletion: kernel = 7 gives the strongest performance (0.9903 ± 0.0017).
- Frame insertion: kernel = 5 yields the highest stability and accuracy (0.9927 ± 0.0016).
- Frame duplication: kernel = 5 achieves the best F1 (0.9577 ± 0.0048).
- Multiclass: kernel = 5 achieves the most stable and accurate results (0.9824 ± 0.0010).
3.6.3. Dropout Moderately Influences Generalization, but no Extreme Sensitivity is Observed
- 0.5 is more robust for deletion and multiclass detection.
- 0.3 gives marginally better stability on insertion and duplication.
3.6.4. Efficiency is Stable Across Architectural Variations
3.6.5. Best Configurations Per Task
3.7. Efficiency Analysis
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| CNN | Convolutional Neural Network |
| 1D CNN | One-Dimensional Convolutional Neural Network |
| IoT | Internet of Things |
| ROC | Receiver Operating Characteristic |
| AUC | Area Under the Curve |
| FPS | Frames Per Second |
| PFOV | Pseudo Flow Orientation Variation |
| LBP | Local Binary Pattern |
| SIFT | Scale-Invariant Feature Transform |
| RANSAC | RANSAC: Random Sample Consensus |
| KPCA | Kernel Principal Component Analysis |
| GRU | Gated Recurrent Unit |
| LSTM | Long Short-Term Memory |
| DCT | Discrete Cosine Transform |
| IRB | Institutional Review Board |
References
- Giovannini, E; Giorgetti, A; Pelletti, G; et al. Importance of dashboard camera (Dash Cam) analysis in fatal vehicle–pedestrian crash reconstruction. Forensic Sci Med Pathol 2021, 17, 379–387. [Google Scholar] [CrossRef]
- Singh, RD; Aggarwal, N. Video content authentication techniques: a comprehensive survey. Multimed Syst 2018, 24, 211–240. [Google Scholar]
- Wang, Q; Li, Z; Zhang, Z; et al. Video Inter-Frame Forgery Identification Based on Consistency of Correlation Coefficients of Gray Values. Journal of Computer and Communications 2014, 02, 51–57. [Google Scholar] [CrossRef]
- Zhang, Z; Hou, J; Ma, Q; et al. Efficient video frame insertion and deletion detection based on inconsistency of correlations between local binary pattern coded frames. Security and Communication Networks 2015, 8, 311–320. [Google Scholar] [CrossRef]
- Kingra, S; Aggarwal, N; Singh, RD. Inter-frame forgery detection in H.264 videos using motion and brightness gradients. Multimed Tools Appl 2017, 76, 25767–25786. [Google Scholar] [CrossRef]
- Li, S; Huo, H. Frame deletion detection based on optical flow orientation variation. IEEE Access 2021, 9, 37196–37209. [Google Scholar] [CrossRef]
- Jin, X; Su, Y; Jing, P. Video frame deletion detection based on time–frequency analysis. J Vis Commun Image Represent, 2022; 83, Epub ahead of print. [Google Scholar] [CrossRef]
- Bakas, J; Naskar, R; Bakshi, S. Detection and localization of inter-frame forgeries in videos based on macroblock variation and motion vector analysis. Computers and Electrical Engineering, 2021; 89, Epub ahead of print. [Google Scholar] [CrossRef]
- Shelke, NA; Kasana, SS. Multiple forgeries identification in digital video based on correlation consistency between entropy coded frames. Multimed Syst 2022, 28, 267–280. [Google Scholar] [CrossRef]
- Bakas, J; Naskar, R; Dixit, R. Detection and localization of inter-frame video forgeries based on inconsistency in correlation distribution between Haralick coded frames. Multimed Tools Appl 2019, 78, 4905–4935. [Google Scholar] [CrossRef]
- Fadl, S; Han, Q; Li, Q. CNN spatiotemporal features and fusion for surveillance video forgery detection. Signal Process Image Commun, 2021; 90, Epub ahead of print. [Google Scholar] [CrossRef]
- Gowda, R; Pawar, D. Deep learning-based forgery identification and localization in videos. Signal Image Video Process 2023, 17, 2185–2192. [Google Scholar] [CrossRef]
- Li, S; Huo, H. Continuity-attenuation captured network for frame deletion detection. Signal Image Video Process 2024, 18, 3285–3297. [Google Scholar] [CrossRef]
- Akhtar, N; Hussain, M; Habib, Z. DEEP-STA: Deep Learning-Based Detection and Localization of Various Types of Inter-Frame Video Tampering Using Spatiotemporal Analysis. Mathematics, 2024; 12, Epub ahead of print. [Google Scholar] [CrossRef]
- Shehnaz, Kaur M. Detection and localization of multiple inter-frame forgeries in digital videos. Multimed Tools Appl 2024, 83, 71973–72005. [Google Scholar] [CrossRef]
- Kumar, V; Gaur, M. Multiple forgery detection in video using inter-frame correlation distance with dual-threshold. Multimed Tools Appl 2022, 81, 43979–43998. [Google Scholar] [CrossRef]
- Paing S, Htun Y. Video Inter-frame Forgery Identification by a Statistical Method. In: 2023 International Conference on the Confluence of Advancements in Robotics, Vision and Interdisciplinary Technology Management, IC-RVITM 2023. Institute of Electrical and Electronics Engineers Inc., 2023. Epub ahead of print 2023. [CrossRef]
- Tinipuclla C, Ceron J, Shiguihara P. Frame Deletion Detection in Videos Using Convolutional Neural Networks. In: IEEE Andescon, ANDESCON 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. Epub ahead of print 2024. [CrossRef]
- Singla N, Singh J, Nagpal S. Video Frame Deletion Detection using Correlation Coefficients. In: Proceedings of the 8th International Conference on Signal Processing and Integrated Networks, SPIN 2021. Institute of Electrical and Electronics Engineers Inc., 2021, pp. 796–799.
- Panchal, HD; Shah, HB. Multiple forgery detection in digital video based on inconsistency in video quality assessment attributes. Multimed Syst 2023, 29, 2439–2454. [Google Scholar] [CrossRef]
- Prashant, KJ; Krishnrao, KP. Frame Shuffling Forgery Detection Method for MPEG-Coded Video. Journal of The Institution of Engineers (India): Series B 2024, 105, 635–645. [Google Scholar] [CrossRef]
- Kharat, J; Chougule, S. A passive blind forgery detection technique to identify frame duplication attack. Multimed Tools Appl 2020, 79, 8107–8123. [Google Scholar] [CrossRef]
- Alsakar, YM; Mekky, NE; Hikal, NA. Detecting and Locating Passive Video Forgery Based on Low Computational Complexity Third-Order Tensor Representation. J Imaging, 2021; 7, Epub ahead of print. [Google Scholar] [CrossRef]
- Shelke, NA; Kasana, SS. Multiple forgery detection in digital video with VGG-16-based deep neural network and KPCA. Multimed Tools Appl 2024, 83, 5415–5435. [Google Scholar] [CrossRef]
- Li, S; Huo, H. Frame deletion detection based on optical flow orientation variation. IEEE Access 2021, 9, 37196–37209. [Google Scholar] [CrossRef]
- Kumar, V; Gaur, M; kansal, V. Deep feature based forgery detection in video using parallel convolutional neural network: VFID-Net. Multimed Tools Appl 2022, 81, 42223–42240. [Google Scholar] [CrossRef]
- Oliaei, H; Azghani, M. Video motion forgery detection using motion residual and object tracking. Multimed Tools Appl 2024, 83, 12651–12668. [Google Scholar] [CrossRef]
- Mohiuddin, S; Malakar, S; Sarkar, R. An ensemble approach to detect copy-move forgery in videos. Multimed Tools Appl 2023, 82, 24269–24288. [Google Scholar] [CrossRef]
- Che, Z; Li, G; Li, T; et al. D$^2$-City: A Large-Scale Dashcam Video Dataset of Diverse Traffic Scenarios. 2019. Available online: http://arxiv.org/abs/1904.01975.
- Shelton, J; Kumar, GP. Comparison between Auditory and Visual Simple Reaction Times. Neurosci Med 2010, 01, 30–32. [Google Scholar] [CrossRef]
- Marteniuk RC, Mackenzie CL. Constraints on Human Arm Movement Trajectories*. 1987.
- Xia, Y; Zhang, D; Kim, J; et al. Predicting Driver Attention in Critical Situations. 2018. Available online: http://arxiv.org/abs/1711.06406.
- Oh, S; Hoogs, A; Perera, A; et al. A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video. 2011. Available online: www.viratdata.org.
- Köpüklü, O; Kose, N; Gunduz, A; et al. Resource Efficient 3D Convolutional Neural Networks. 2021. Available online: http://arxiv.org/abs/1904.02422.
- M¨antt¨ari J, Broom´ S, Folkesson J, et al. Interpreting Video Features: A Comparison of 3D Convolutional Networks and Convolutional LSTM Networks. 2020.















| Task | Kernel | Blocks | Dropout | Runs | Test Accuracy (mean ± std) | Test F1 (mean ± std) | Inference s (mean) |
|---|---|---|---|---|---|---|---|
| deletion | 7 | 2 | 0.5 | 3 | 0.9903 ± 0.0017 | 0.9903 ± 0.0017 | 0.0528 |
| duplication | 5 | 2 | 0.3 | 3 | 0.9581 ± 0.0048 | 0.9577 ± 0.0048 | 0.0703 |
| insertion | 5 | 2 | 0.3 | 3 | 0.9927 ± 0.0016 | 0.9927 ± 0.0016 | 0.0522 |
| multiclass | 5 | 2 | 0.5 | 3 | 0.9824 ± 0.0009 | 0.9824 ± 0.0010 | 0.0537 |
| Experiment | Test Videos | Avg. Preprocessing Time (s) | Avg. Inference Time (s ± SD) | FPS | Avg. Memory Overhead (MB) | Peak Memory Increase (MB) |
|---|---|---|---|---|---|---|
| Frame Deletion | 342 | 1.1651 | 0.0557 ± 0.0026 | 17.96 | 0.0161 | 0.4492 |
| Frame Insertion | 366 | 1.3652 | 0.0554 ± 0.0032 | 18.06 | 0.0129 | 0.4883 |
| Frame Duplication | 318 | 1.3499 | 0.0553 ± 0.0030 | 18.09 | 0.0188 | 1.8867 |
| Multiclass Classification | 1080 | 1.3588 | 0.0550 ± 0.0043 | 18.17 | 0.0882 | 0.3867 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).