Submitted:
09 September 2024
Posted:
10 September 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
2.1. Deepfake Generation
2.2. Time Domain and Frequency Domain
2.3. Literature Review
| Paper Title | Main Idea | Model | Language | Fakeness Type | Dataset |
|---|---|---|---|---|---|
| Arabic Audio Clips: Identification and Discrimination of Authentic Cantillations from Imitations | 1- Compare classifiers’ performance. 2- Compare recognition with human experts. 3- Authentic reciter identification. | 1- Classic 2- Deep learning | Arabic | Imitation | Arabic Diversified Audio |
| A Compressed Synthetic Speech Detection Method with Compression Feature Embedding | 1- Detect synthetic speech in compressed formats. 2- Multi-branch residual network. 3- Evaluate method on ASVspoof. 4- Compare with state-of-the-art. | Deep learning (DNN) | Not specified | Synthetic | ASVspoof |
| Learning Efficient Representations for Fake Speech Detection | 1- Develop efficient fake speech detection models. 2- Ensure accuracy. 3- Adapt to new forms and sources of fake speech. | 1- Machine learning (CNN, RNN, SVM, KNN) 2- Deep learning (CNNs, RNNs) | English | Imitation, Synthetic | ASVSpoof, RTVCSpoof |
| WaveFake: A Data Set for Audio Deepfake Detection | 1- Create a dataset of fake audio. 2- Provide baseline models for detection. 3- Test generalization on data from different techniques. 4- Test on real-life scenarios. | Fake audio synthesizing: MelGAN, Parallel WaveGAN, Multi-band MelGAN, Full-band MelGAN, HiFi-GAN, WaveGlow Baseline detection models: Gaussian Mixture Model (GMM), RawNet2 | English, Japanese | Synthetic | WaveFake |
| Attack Agnostic Dataset: Generalization and Stabilization of Audio Deepfake Detection | 1- Analyze and compare model generalization. 2- Combine multiple datasets. 3- Train on inaudible features. 4- Monitor training for stability. | LCNN, XceptionNet, MesoInception, RawNet2, and GMM | English | Synthetic, Imitation | WaveFake, FakeAVCeleb, ASVspoof |
| Exposing AI-Synthesized Human Voices Using Neural Vocoder Artifacts | 1- Utilize the identification of vocoders’ artifacts in audio to detect deepfakes. 2- Create a dataset of fake audio using six vocoders. | Detection model: RawNet2 Dataset construction: WaveNet, WaveRNN, Mel-GAN, Parallel WaveGAN, WaveGrad, DiffWave | English | Synthetic | LibriVoc, WaveFake, and ASVspoof2019 |
3. Approach
3.1. Datasets
- ASVspoof2019 ASVspoof (Automatic Speaker Verification Spoofing and Countermeasures) is an international challenge focusing on spoofing detection in automatic speaker verification systems [4]. The ASVspoof2019 dataset contains three sub-datasets: Logical access, Physical access, and Speech deepfake. Each of these datasets was created using different techniques depending on the task, like text-to-speech (TTS) and voice conversion (VC) algorithms [4]. We will be utilizing the train subset of the logical access dataset in our experiment.
- In the Wild is a dataset containing fake and real audio from 58 celebrities and politicians sourced from publicly available media. It was designed to evaluate deepfake detection models, particularly their ability to generalize to real-world data. [5]
- FakeAVCeleb contains both deepfake and real audio and videos of celebrities. The fake audio was created using a text-to-speech service followed by manipulation with a voice cloning tool to mimic the celebrity’s voice [6]. We extracted the audio as WAV files from each video.
- The Fake-or-Real (FoR) Dataset comprises 111,000 files of real speech and 87,000 files of fake speech. It encompasses both MP3 and WAV file formats. offers four distinct versions to suit various needs. The ’for-original’ files are in their original state, while the ’for-norm’ version with normalization. ’For-2sec’ is shortened to 2 seconds, while ’for-rerec’ simulates re-recorded data, depicting deepfake from a phone call scenario [13].
3.2. Data Preprocessing
- First, any silence in the audio was trimmed to enable the model to focus on speech parts of the audio.
- Second, all audio clips are trimmed to a standardized length of 4 seconds or padded by repeating the audio. To avoid silence that could potentially bias the model’s learning.
- Third, the audio data is downsampled to a rate of 16 kHz. Downsampling reduces the data size and computational load without significantly compromising the audio signal’s quality or essential characteristics.
- Lastly, the audio is converted to a mono channel using channel averaging. Stereo audio, which uses two channels (left and right), is averaged into a single channel, resulting in a mono audio signal.
3.3. Feature Extraction
3.3.1. Short-Time Power Spectrum
3.3.2. Constant-Q Transform
3.4. Sonic Sleuth Structure and Training Details
- Early Stopping: An early stopping technique was implemented with a patience of 10 epochs to prevent overfitting. This means the training process would be stopped if the validation loss did not improve for 10 consecutive epochs.
- Optimizer: The ADAM optimizer was used for updating the model’s weights during training.
- Loss Function: Binary cross-entropy loss was used as the loss function, which is suitable for binary classification problems.
- Class Weights: Class weights were calculated and applied during the training process to handle the class imbalance in the dataset. Class weights adjust the importance of each class during the loss calculation, helping the model learn from imbalanced datasets more effectively while avoiding bias.
3.5. Evaluation Metrics
4. Results and Discussion
4.1. Test on External Dataset
4.2. Ensemble Approach
5. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
References
- Camastra, F.; Vinciarelli, A. Machine Learning for Audio, Image and Video Analysis: Theory and Applications; Springer: London, UK, 2015. [Google Scholar]
- Tenoudji, F.C. Analog and Digital Signal Analysis: From Basics to Applications; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar]
- Natsiou, A.; O’Leary, S. Audio Representations for Deep Learning in Sound Synthesis: A Review. arXiv, 2022, cs.SD. Available online: https://arxiv.org/abs/2201.02490 (accessed on Day Month Year).
- Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; Juvela, L.; Alku, P.; Peng, Y.-H.; Hwang, H.-T.; Tsao, Y.; Wang, H.-M.; Le Maguer, S.; Becker, M.; Henderson, F.; Clark, R.; Zhang, Y.; Jia, Q.; Onuma, K.; Mushika, K.; Kaneda, T.; Jiang, Y.; Liu, L.-J.; Wu, Y.-C.; Huang, W.-C.; Toda, T.; Tanaka, K.; Kameoka, H.; Steiner, I.; Matrouf, D.; Bonastre, J.-F.; Govender, A.; Ronanki, S.; Zhang, J.-X.; Ling, Z.-H. ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech. arXiv, 2020, eess.AS. Available online: https://arxiv.org/abs/1911.01601 (accessed on Day Month Year).
- Müller, N.M.; Czempin, P.; Dieckmann, F.; Froghyar, A.; Böttinger, K. Does Audio Deepfake Detection Generalize? Interspeech, 2022.
- Khalid, H.; Tariq, S.; Kim, M.; Woo, S.S. FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. Available online: https://openreview.net/forum?id=TAXFsg6ZaOl (accessed on Day Month Year).
- Marcus, G. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. CoRR, 2020, abs/2002.06177. Available online: https://arxiv.org/abs/2002.06177 (accessed on Day Month Year).
- Frank, J.; Schönherr, L. WaveFake: A Data Set to Facilitate Audio Deepfake Detection. CoRR, 2021, abs/2111.02813. Available online: https://arxiv.org/abs/2111.02813 (accessed on Day Month Year).
- Kawa, P.; Plata, M.; Syga, P. Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. In Proceedings of Interspeech 2022, ISCA, September 2022. DOI: 10.21437/interspeech.2022-10078.
- Sun, C.; Jia, S.; Hou, S.; AlBadawy, E.; Lyu, S. Exposing AI-Synthesized Human Voices Using Neural Vocoder Artifacts. arXiv, 2023, cs.SD. Available online: https://arxiv.org/abs/2302.09198 (accessed on Day Month Year).
- Zhang, C.; Zhang, C.; Zheng, S.; Zhang, M.; Qamar, M.; Bae, S.-H.; Kweon, I.S. A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI. arXiv, 2023, cs.SD. Available online: https://arxiv.org/abs/2303.13336 (accessed on Day Month Year).
- Gu, Y.; Chen, Q.; Liu, K.; Xie, L.; Kang, C. GAN-based Model for Residential Load Generation Considering Typical Consumption Patterns. In Proceedings of ISGT 2019, IEEE, November 2018. DOI: 10.1109/ISGT.2019.8791575.
- Abdeldayem, M. The Fake-or-Real Dataset. Kaggle Dataset, 2022. Available online: https://www.kaggle.com/ datasets/mohammedabdeldayem/the-fake-or-real-dataset (accessed on Day Month Year).
- Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A Comparison of Features for Synthetic Speech Detection. Interspeech 2015; 2015. Available online: https://doi.org/10.21437/Interspeech.2015-472 (accessed on Day Month Year). Available online.
- Frank, J.; Schönherr, L. WaveFake: A Data Set to Facilitate Audio Deepfake Detection. CoRR; 2021, Volume abs/2111.02813. Available online: https://arxiv.org/abs/2111.02813 (accessed on Day Month Year).
- Zheng, F.; Zhang, G. Integrating the energy information into MFCC. Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP 2000); 2000, Volume 1, pp. 389-392. Available online: https://doi.org/10.21437/ICSLP.2000-96 (accessed on Day Month Year).
- Todisco, M.; Delgado, H.; Evans, N. Constant Q Cepstral Coefficients: A Spoofing Countermeasure for Automatic Speaker Verification. Computer Speech & Language 2017, 45, 516–535. Available online: https://doi.org/10.1016/j.csl.2017.01.001 (accessed on 14 February 2020).
- Deepfakes (a portmanteau of "deep learning" and "fake"). Images, videos, or audio edited or generated using artificial intelligence tools. Synthetic Media, 2023. Available online: https://en.wikipedia.org/wiki/Deepfake (accessed on Day Month Year).



| Name | Size | Length | Sample Rate | File Format | URL | |
|---|---|---|---|---|---|---|
| real | fake | |||||
| ASVspoof2019 | 2,580 files | 22,800 files | Avg. 3 s | - | flac | Link |
| ’In-the-Wild’ | 19,963 files | 11,816 files | Avg. 4.s | 16 kHz | WAV | Link |
| FakeAVCeleb | 10,209 files | 11,357 files | Avg. 5 s | - | MP4 | Link |
| The Fake-or-Real | 111,000 files | 87,000 files | (1 - 20)s | - | WAV / MP3 | Link |
| Feature | EER | Accuracy | F1 |
|---|---|---|---|
| CQT | 0.0757 | 94.15% | 95.76% |
| LFCC | 0.0160 | 98.27% | 98.65% |
| MFCC | 0.0185 | 98.04% | 98.45% |
| Feature | EER | Accuracy | F1 |
|---|---|---|---|
| CQT | 0.0942 | 82.51% | 83.19% |
| LFCC | 0.1718 | 75.28% | 72.63% |
| MFCC | 0.3165 | 61.24% | 54.94% |
| Feature | EER | Accuracy | F1 |
|---|---|---|---|
| CQT and LFCC | 0.0851 | 84.92% | 84.73% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).