Submitted:
02 June 2026
Posted:
03 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Audio Deepfake Detection and Vocoder Fingerprints
2.2. Detector Representations for Synthetic Speech Detection
2.3. Adversarial Attacks and Transferability
3. Methodology
3.1. Overview
3.2. Problem Formulation
3.3. Universal Perturbation Setting
3.4. Perturbation Domains
3.4.1. Waveform-Domain Perturbation
3.4.2. STFT-Magnitude-Domain Perturbation
3.5. Attack Algorithms
3.6. Transferability Evaluation
3.7. Audio Quality and Intelligibility Metrics
4. Experimental Setup
4.1. Dataset and Vocoders
4.2. Detector Configuration
4.3. Attack Configuration
| Parameter | Waveform domain | STFT-magnitude domain |
|---|---|---|
| Sampling rate | Hz | Hz |
| Chunk length | 4 s / samples | 4 s / samples |
| STFT | – | 1024 |
| STFT hop length | – | 256 |
| STFT window length | – | 1024 |
| Perturbation budget | / | / |
| BIM step size | ||
| PGD step size | ||
| BIM/PGD steps | 5000 | 5000 |
| Batch size | 8 | 8 |
| CW steps | 5000 | 5000 |
| CW learning rate | ||
| CW confidence margin | ||
| CW loss weight |
4.4. Evaluation Protocol
5. Results
5.1. Frequency-Domain Attack Results
5.2. Waveform-Domain Attack Results
5.3. Cross-Domain Comparison
5.4. Black-Box Transferability Across Detectors
5.5. Audio Quality and Intelligibility
6. Discussion
6.1. Perturbation Domain and Detector Representation
6.2. Source-Model Success Does Not Guarantee Transferability
6.3. Quality-Aware Interpretation of Attack Strength
7. Conclusion
References
- Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
- Mustafa, A.; Pia, N.; Fuchs, G. StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6034–6038. [Google Scholar]
- Yang, G.; Yang, S.; Liu, K.; Fang, P.; Chen, W.; Xie, L. Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop, Shenzhen, China, 19–22 January 2021; pp. 492–498. [Google Scholar]
- Yamamoto, R.; Song, E.; Kim, J.-M. Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
- Li, M.; Ahmadiadli, Y.; Zhang, X.-P. A Survey on Speech Deepfake Detection. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
- Sisman, B.; Yamagishi, J.; King, S.; Li, H. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
- Tan, X.; Qin, T.; Soong, F.; Liu, T.-Y. A Survey on Neural Speech Synthesis. arXiv 2021, arXiv:2106.15561. [Google Scholar] [CrossRef]
- Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio Deepfake Detection: A Survey. arXiv 2023, arXiv:2308.14970. [Google Scholar] [CrossRef]
- Hasanabadi, M.R. An Overview of Text-to-Speech Systems and Media Applications. arXiv 2023, arXiv:2310.14301. [Google Scholar]
- Spanias, A.S. Speech Coding: A Tutorial Review. Proc. IEEE 1994, 82, 1541–1582. [Google Scholar] [CrossRef]
- Shaaban, O.A.; Yildirim, R.; Alguttar, A.A. Audio Deepfake Approaches. IEEE Access 2023, 11, 132652–132682. [Google Scholar] [CrossRef]
- Zhou, X.; Garcia-Romero, D.; Duraiswami, R.; Espy-Wilson, C.; Shamma, S. Linear versus Mel Frequency Cepstral Coefficients for Speaker Recognition. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011; pp. 559–564. [Google Scholar]
- Todisco, M.; Delgado, H.; Evans, N. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients. In Proceedings of the Odyssey 2016: The Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2016; pp. 283–290. [Google Scholar]
- Jung, J.-w.; Heo, H.-S.; Tak, H.; Shim, H.-j.; Chung, J.S.; Lee, B.-J.; Yu, H.-J.; Evans, N. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. arXiv 2021, arXiv:2110.01200. [Google Scholar]
- Wang, X.; He, K. Enhancing the Transferability of Adversarial Attacks through Variance Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1924–1933. [Google Scholar]
- Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. Adversarial Attacks and Defences: A Survey. arXiv 2018, arXiv:1810.00069. [Google Scholar] [CrossRef]
- Demontis, A.; Melis, M.; Pintor, M.; Jagielski, M.; Biggio, B.; Oprea, A.; Nita-Rotaru, C.; Roli, F. Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 321–338. [Google Scholar]
- Żelasko, P.; Joshi, S.; Shao, Y.; Villalba, J.; Trmal, J.; Dehak, N.; Khudanpur, S. Adversarial Attacks and Defenses for Speech Recognition Systems. arXiv 2021, arXiv:2103.17122. [Google Scholar] [CrossRef]
- Sun, C.; Jia, S.; Hou, S.; Lyu, S. AI-Synthesized Voice Detection Using Neural Vocoder Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 904–912. [Google Scholar]
- Deng, J.; Ren, Y.; Zhang, T.; Zhu, H.; Sun, Z. VFD-Net: Vocoder Fingerprints Detection for Fake Audio. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 12151–12155. [Google Scholar]
- Li, F.; Chen, Y.; Liu, H.; Zhao, Z.; Yao, Y.; Liao, X. Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
- Costa, J.C.; Roxo, T.; Proença, H.; Inacio, P.R.M. How Deep Learning Sees the World: A Survey on Adversarial Attacks and Defenses. IEEE Access 2024, 12, 61113–61136. [Google Scholar] [CrossRef]
- Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial Machine Learning at Scale. arXiv 2016, arXiv:1611.01236. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
- Carlini, N.; Wagner, D. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 3–14. [Google Scholar]
- Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial Examples in the Physical World. In Artificial Intelligence Safety and Security; Yampolskiy, R.V., Ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
- Papernot, N.; McDaniel, P.; Goodfellow, I. Transferability in Machine Learning: From Phenomena to Black-Box Attacks Using Adversarial Samples. arXiv 2016, arXiv:1605.07277. [Google Scholar] [CrossRef]
- Ito, K.; Johnson, L. The LJ Speech Dataset. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 25 May 2026).
- Yan, X.; Yi, J.; Tao, J.; Wang, C.; Ma, H.; Wang, T.; Wang, S.; Fu, R. An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, Lisboa, Portugal, 14 October 2022; pp. 61–68. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
| Detector | Vocoder | Clean Acc. | PGD | FGSM | BIM | CW |
|---|---|---|---|---|---|---|
| ResNet+LFCC | HiFi-GAN | 99.3 | 99.99 | 6.00 | 2.00 | 6.00 |
| Fullband MelGAN | 96.8 | 99.99 | 3.00 | 4.00 | 6.00 | |
| StyleMelGAN | 98.7 | 100.00 | 4.00 | 2.00 | 0.00 | |
| Parallel WaveGAN | 100.0 | 99.99 | 0.00 | 0.00 | 0.00 | |
| LCNN+CQCC | HiFi-GAN | 96.0 | 0.00 | 7.00 | 8.00 | 6.00 |
| Fullband MelGAN | 95.0 | 0.00 | 27.00 | 15.00 | 10.00 | |
| StyleMelGAN | 100.0 | 0.00 | 2.00 | 2.00 | 1.00 | |
| Parallel WaveGAN | 100.0 | 15.00 | 0.00 | 0.00 | 0.00 | |
| BiLSTM | HiFi-GAN | 99.79 | 0.00 | 1.91 | 0.74 | 1.38 |
| Fullband MelGAN | 99.47 | 0.00 | 4.03 | 1.70 | 1.17 | |
| StyleMelGAN | 100.0 | 0.00 | 0.11 | 0.42 | 0.00 | |
| Parallel WaveGAN | 100.0 | 0.00 | 0.00 | 0.00 | 0.00 | |
| AASIST | HiFi-GAN | 94.0 | 34.00 | 6.69 | 6.37 | 6.37 |
| Fullband MelGAN | 93.0 | 45.00 | 8.60 | 8.07 | 8.28 | |
| StyleMelGAN | 97.0 | 26.00 | 3.40 | 3.08 | 3.29 | |
| Parallel WaveGAN | 94.0 | 53.00 | 6.79 | 6.37 | 6.69 |
| Detector | Vocoder | Clean Acc. | PGD | FGSM | BIM | CW |
|---|---|---|---|---|---|---|
| ResNet+LFCC | HiFi-GAN | 99.3 | 100.0 | 100.0 | 100.0 | 100.0 |
| Fullband MelGAN | 96.8 | 92.0 | 100.0 | 100.0 | 100.0 | |
| StyleMelGAN | 98.7 | 100.0 | 1.0 | 100.0 | 100.0 | |
| Parallel WaveGAN | 100.0 | 100.0 | 100.0 | 95.0 | 100.0 | |
| LCNN+CQCC | HiFi-GAN | 96.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| Fullband MelGAN | 95.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
| StyleMelGAN | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
| Parallel WaveGAN | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
| BiLSTM | HiFi-GAN | 99.79 | 99.79 | 100.0 | 100.0 | 100.0 |
| Fullband MelGAN | 99.47 | 9.45 | 0.42 | 1.06 | 99.79 | |
| StyleMelGAN | 100.0 | 32.59 | 98.62 | 96.92 | 100.0 | |
| Parallel WaveGAN | 100.0 | 0.32 | 100.0 | 100.0 | 9.87 | |
| AASIST | HiFi-GAN | 94.0 | 37.0 | 0.0 | 100.0 | 85.35 |
| Fullband MelGAN | 93.0 | 39.0 | 0.0 | 100.0 | 89.60 | |
| StyleMelGAN | 97.0 | 24.0 | 0.0 | 100.0 | 71.66 | |
| Parallel WaveGAN | 94.0 | 38.0 | 0.0 | 100.0 | 87.69 |
| Domain | Detector | PGD | FGSM | BIM | CW |
|---|---|---|---|---|---|
| STFT-mag | ResNet+LFCC | 99.99 | 3.25 | 2.00 | 3.00 |
| LCNN+CQCC | 3.75 | 9.00 | 6.25 | 4.25 | |
| BiLSTM | 0.00 | 1.51 | 0.72 | 0.64 | |
| AASIST | 39.50 | 6.37 | 5.97 | 6.16 | |
| Waveform | ResNet+LFCC | 98.00 | 75.25 | 98.75 | 100.00 |
| LCNN+CQCC | 100.00 | 100.00 | 100.00 | 100.00 | |
| BiLSTM | 35.54 | 74.76 | 74.50 | 77.42 | |
| AASIST | 34.50 | 0.00 | 100.00 | 83.58 |
| Domain | Attack | AASIST ASR | ResNet+LFCC | LCNN+CQCC | BiLSTM | Avg. Black-box ASR |
|---|---|---|---|---|---|---|
| STFT-mag | PGD | 39.50 | 99.99 | 3.75 | 0.00 | 34.58 |
| STFT-mag | FGSM | 6.37 | 3.25 | 9.00 | 1.51 | 4.59 |
| STFT-mag | BIM | 5.97 | 2.00 | 6.25 | 0.72 | 2.99 |
| STFT-mag | CW | 6.16 | 3.00 | 4.25 | 0.64 | 2.63 |
| Waveform | PGD | 34.50 | 98.00 | 100.00 | 35.54 | 77.85 |
| Waveform | FGSM | 0.00 | 75.25 | 100.00 | 74.76 | 83.34 |
| Waveform | BIM | 100.00 | 98.75 | 100.00 | 74.50 | 91.08 |
| Waveform | CW | 83.58 | 100.00 | 100.00 | 77.42 | 92.47 |
| Domain | Attack | HiFi-GAN | Fullband MelGAN | StyleMelGAN | Parallel WaveGAN | Avg. WER |
|---|---|---|---|---|---|---|
| Clean | None | 3.74 | 3.99 | 4.12 | 3.97 | 3.96 |
| STFT-mag | FGSM | 3.72 | 3.97 | 4.03 | 3.99 | 3.93 |
| BIM | 3.72 | 3.98 | 4.02 | 4.06 | 3.95 | |
| CW | 3.74 | 4.01 | 4.05 | 4.01 | 3.95 | |
| PGD | 4.65 | 5.50 | 5.51 | 5.28 | 5.24 | |
| Waveform | FGSM | 16.28 | 17.83 | 16.05 | 19.94 | 17.53 |
| BIM | 11.77 | 13.92 | 12.35 | 13.30 | 12.84 | |
| CW | 3.85 | 4.18 | 4.17 | 4.09 | 4.07 | |
| PGD | 3.84 | 4.12 | 4.16 | 3.98 | 4.03 |
| Domain | Vocoder | FGSM | BIM | PGD | CW |
|---|---|---|---|---|---|
| STFT-mag | HiFi-GAN | ||||
| Fullband MelGAN | |||||
| StyleMelGAN | |||||
| Parallel WaveGAN | |||||
| Waveform | HiFi-GAN | ||||
| Fullband MelGAN | |||||
| StyleMelGAN | |||||
| Parallel WaveGAN |
| Domain | Attack | Avg. WER | WER | Avg. SNR |
|---|---|---|---|---|
| STFT-mag | FGSM | |||
| STFT-mag | BIM | |||
| STFT-mag | PGD | |||
| STFT-mag | CW | |||
| Waveform | FGSM | |||
| Waveform | BIM | |||
| Waveform | PGD | |||
| Waveform | CW |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.