Preprint
Article

This version is not peer-reviewed.

Evaluating Adversarial Robustness of Deepfake Audio Detectors and Vocoder Fingerprint Detectors Against Universal Adversarial Perturbations

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract
Audio deepfake and vocoder fingerprint detectors are increasingly used to identify synthetic speech and attribute it to its generating model. However, their robustness against adversarial perturbations remains unclear across different attack algorithms, perturbation domains, and detector representations. This paper presents a comparative study of four adversarial attacks, namely Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and Carlini & Wagner (CW), against audio deepfake and vocoder fingerprint detectors. Each attack is implemented in both waveform-domain and STFT-magnitude-domain settings. All attacks are optimized against AASIST using a targeted fake-to-real objective and are evaluated on synthetic speech generated by HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN. Black-box transferability is assessed across multiple detector families, including AASIST, ResNet with LFCC features, LCNN with CQCC features, and a BiLSTM-based detector. The results show that adversarial effectiveness depends strongly on perturbation domain and detector representation. STFT-magnitude PGD transfers strongly to LFCC-based ResNet detectors but has limited effect on CQCC-based and recurrent detectors. In contrast, waveform-domain attacks produce broader transferability, although FGSM and BIM substantially degrade audio quality. To distinguish effective adversarial perturbations from destructive signal degradation, we evaluate audio quality and intelligibility using word error rates and signal-to-noise ratio. Overall, the findings show that robustness claims in audio deepfake detection are limited when considering adversarial perturbation.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

Neural vocoders have become a central component of modern speech synthesis systems, enabling the generation of highly natural speech waveforms from intermediate acoustic representations such as mel-spectrograms. Models such as HiFi-GAN, MelGAN variants, StyleMelGAN, and Parallel WaveGAN [1,2,3,4] can produce synthetic speech with high perceptual quality, making synthetic audio increasingly difficult to distinguish from genuine recordings. While these systems support legitimate applications in text-to-speech and voice conversion, they also increase the risk of audio deepfakes being used for impersonation, misinformation, and fraud.
To address this risk, audio deepfake detection methods aim to distinguish synthetic speech from real speech. Beyond binary real-versus-fake detection, a related forensic task is vocoder fingerprint detection, where the objective is to identify the vocoder or synthesis model responsible for generating a speech sample. The motivation behind vocoder fingerprinting is that different neural vocoders may leave model-specific artifacts in the generated waveform [29]. These artifacts can appear in spectral structure, cepstral patterns, temporal dynamics, or other signal characteristics learned by the detector. If reliable, vocoder fingerprints can support both synthetic speech detection and source attribution.
However, the reliability of detecting vocoder fingerprints depends on whether these artifacts remain stable under perturbation. A detector may achieve high accuracy on clean synthetic speech but fails when the input is intentionally modified to disrupt the fingerprint cues used for classification. This is particularly important in adversarial settings, where an attacker may attempt to preserve the perceived speech content while causing the detector to misclassify the sample. Therefore, evaluating clean accuracy alone is insufficient for assessing the practical robustness of vocoder fingerprint detectors.
Adversarial attacks provide a direct way to examine this vulnerability [15,16]. In audio, adversarial perturbations can be applied in different domains. Waveform-domain attacks modify the raw audio samples directly, while frequency-domain attacks modify a time-frequency representation such as the STFT magnitude before reconstructing the waveform. These perturbation domains may interact differently with detector architectures. For example, waveform-based detectors such as AASIST [14] may rely on raw temporal cues, while feature-based detectors using LFCC or CQCC [12,13] representations may be more sensitive to spectral and cepstral distortions. As a result, an attack that is effective against one detector may not be able to transfer to another.
This transferability problem is central to realistic adversarial evaluation [17]. In practice, an attacker may not know the exact detector used by a forensic system. If adversarial audio optimized against one detector also fools other detectors, this indicates a broader weakness across detector families. Conversely, if transferability is limited, this suggests that robustness may depend strongly on the detector representation. Existing evaluations [17] often focus on a single detector, feature type, attack algorithm, or perturbation domain, making it difficult to determine whether observed vulnerabilities generalize across vocoder fingerprint detectors.
This paper presents a comparative study of adversarial attacks against audio deepfake and vocoder fingerprint detectors. We evaluate four representative attacks: Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and Carlini–Wagner (CW). Each attack is implemented in both waveform-domain and STFT-magnitude-domain settings. All attacks are optimized against AASIST using a targeted fake-to-real objective, and the resulting adversarial audio is evaluated for black-box transferability across multiple detector families, including ResNet with LFCC features, LCNN with CQCC features, a BiLSTM-based detector, and AASIST itself. Experiments are conducted on synthetic speech generated by HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN.
The contributions of this paper are threefold. First, we provide a controlled comparison of FGSM, BIM, PGD, and CW attacks under a unified AASIST-source optimization setting. Second, we compare waveform-domain and STFT-magnitude-domain perturbations to examine how perturbation domain affects attack success and black-box transferability. Third, we evaluate transferability across detectors with different input representations and architectures, while also measuring audio quality and intelligibility using word error rates and signal-to-noise ratio. Through this analysis, we show that adversarial robustness in vocoder fingerprint detection depends jointly on attack algorithm, perturbation domain, and detector representation.

3. Methodology

3.1. Overview

This study compares the robustness of audio deepfake and vocoder fingerprint detectors under multiple adversarial attack settings. Four attacks are evaluated: FGSM, BIM, PGD, and CW. Each attack is implemented in two perturbation domains: the waveform domain and the STFT-magnitude domain.
All attacks are optimized against AASIST using a targeted fake-to-real objective. The learned adversarial perturbations are then applied to synthetic speech generated by multiple vocoders and evaluated on both the source detector and independent target detectors. This design allows us to examine source-detector attack success, black-box transferability, and the effect of perturbation domain on detector robustness.

3.2. Problem Formulation

Let x [ 1 , 1 ] T denote a clean synthetic speech waveform of length T. The source detector is denoted as f s ( · ) , where f s corresponds to AASIST in this study. AASIST is treated as a binary classifier with two classes:
0 = fake , 1 = real .
For a synthetic input x, the true label is y = 0 , and the target adversarial label is y t = 1 . The objective is to generate an adversarial waveform x adv such that the source detector predicts the target class while the audio remains close to the original waveform:
x adv = A ( x ; δ ) ,
where A ( · ) denotes the attack transformation and δ denotes the adversarial perturbation. The form of δ depends on the perturbation domain. In the waveform domain, δ is added directly to the raw audio samples. In the STFT-magnitude domain, δ is applied to the magnitude spectrum before waveform reconstruction.

3.3. Universal Perturbation Setting

All attacks are implemented as universal perturbations. Instead of optimizing a separate perturbation for each utterance, each attack learns one reusable perturbation from a source vocoder dataset:
D src = { x 1 , x 2 , , x N } .
The universal perturbation is optimized as:
δ * = arg min δ S 1 N i = 1 N L f s ( A ( x i ; δ ) ) , y t ,
where δ * is the optimized perturbation, S is the allowed perturbation set, and L ( · ) is the attack loss. For FGSM, BIM, and PGD, cross-entropy loss toward the target real class is used. For CW, a logit-margin loss is used.
After optimization, the same perturbation δ * is applied to evaluation audio without re-optimization:
x adv = A ( x ; δ * ) .
This universal setting creates a stricter transferability evaluation than per-sample optimization because the perturbation must generalize across utterances and vocoders.

3.4. Perturbation Domains

3.4.1. Waveform-Domain Perturbation

In the waveform-domain setting, the perturbation is added directly to the raw waveform:
x adv = clip ( x + δ w , 1 , 1 ) ,
where δ w is the waveform-domain perturbation. The clipping operation ensures that the adversarial waveform remains within the normalized amplitude range. The perturbation is constrained by an bound:
δ w ϵ .
This setting evaluates whether direct sample-level perturbations can disrupt detectors that operate on raw waveforms or on features extracted from waveforms.

3.4.2. STFT-Magnitude-Domain Perturbation

In the STFT-magnitude-domain setting, the perturbation is applied to the magnitude spectrum. Given a waveform x, its short-time Fourier transform is:
X = STFT ( x ) .
The complex STFT is decomposed into magnitude and phase:
M = | X | , P = X | X | + η ,
where η is a small constant used to avoid division by zero. The adversarial magnitude is computed as:
M adv = M + δ M ,
where δ M is the STFT-magnitude perturbation. The adversarial waveform is reconstructed using the original phase:
x adv = ISTFT ( M adv · P ) .
Since STFT magnitudes are not bounded within a fixed amplitude range, a relative perturbation budget is used:
| δ M | ϵ rel · M ref ,
where M ref is a reference magnitude template computed from the source training set, and ϵ rel controls the perturbation strength.

3.5. Attack Algorithms

The four attacks differ in how they optimize the universal perturbation.
FGSM performs a single targeted gradient-sign update:
δ * = ϵ · sign δ L CE f s ( A ( x ; δ ) ) , y t .
BIM extends FGSM by applying iterative projected updates:
δ k + 1 = Π S δ k α · sign δ L CE f s ( A ( x ; δ k ) ) , y t ,
where α is the step size and Π S ( · ) projects the perturbation back into the allowed set S .
PGD follows the same iterative update rule as BIM but starts from a random initialization within the allowed perturbation set:
δ 0 U ( S ) .
CW is implemented as an optimization-based targeted attack using a logit-margin objective. Let z fake and z real denote the fake and real logits produced by AASIST. The CW loss is:
L CW = max z fake z real + κ , 0 ,
where κ is a confidence margin. The total CW objective is:
L total = D ( δ ) + c · L CW ,
where D ( δ ) penalizes perturbation size and c controls the weight of the classification loss.

3.6. Transferability Evaluation

The optimized adversarial audio is evaluated under two criteria: source-detector evasion and black-box transferability. For AASIST, attack success is defined as fake audio being classified as real:
ASR AASIST = N fake real N ,
where N fake real is the number of fake samples predicted as real, and N is the total number of evaluated samples.
For multi-class vocoder attribution detectors, attack success is defined as incorrect vocoder attribution:
ASR multi = 1 N correct N ,
where N correct is the number of attacked samples still classified as their true vocoder class. This definition treats any incorrect vocoder prediction as a successful disruption of source attribution.

3.7. Audio Quality and Intelligibility Metrics

Attack success alone is insufficient because a detector may fail when the audio is heavily degraded. Therefore, adversarial effectiveness is evaluated together with speech intelligibility and signal distortion.
Word error rate (WER) is used to measure transcription-level intelligibility:
WER = S + D + I N w ,
where S, D, and I are the numbers of substitutions, deletions, and insertions, respectively, and N w is the number of words in the reference transcript. The WER increase is defined as:
Δ WER = WER adv WER clean .
Signal-to-noise ratio (SNR) is used to measure signal-level distortion between clean and adversarial waveforms:
SNR = 10 log 10 x 2 2 x x adv 2 2 + η ,
where η is a small constant used to avoid division by zero. Higher SNR indicates that the adversarial audio remains closer to the clean waveform.

4. Experimental Setup

4.1. Dataset and Vocoders

Experiments are conducted using synthetic speech derived from the LJSpeech corpus [28]. LJSpeech contains approximately 13 , 100 English utterances from a single female speaker recorded at 22.05 kHz. This provides a controlled single-speaker setting for evaluating vocoder-generated speech and adversarial perturbations.
Four neural vocoders are evaluated: HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN. HiFi-GAN is used as the source vocoder for adversarial optimization, while all four vocoders are used during evaluation. This setup allows the experiment to measure both same-vocoder performance and cross-vocoder transferability. After an attack is optimized on HiFi-GAN-generated speech, the resulting universal perturbation is applied to all vocoder outputs without re-optimization.
All adversarial audio is generated at 22.05 kHz. For detectors requiring a different sampling rate, such as AASIST, audio is resampled internally during evaluation. The attacks are applied after waveform synthesis, and no vocoder model weights are modified.

4.2. Detector Configuration

Four detector configurations are used: AASIST, ResNet+LFCC, LCNN+CQCC, and BiLSTM. These detectors are selected to cover different input representations and model architectures.
AASIST is used as the source detector for attack optimization. It is treated as a binary fake-versus-real classifier, where the target attack objective is to classify synthetic speech as real. The remaining detectors are used to evaluate black-box transferability.
The ResNet+LFCC detector is a multi-class vocoder attribution model referenced from baseline work [29]. Each waveform is converted into linear frequency cepstral coefficient (LFCC) features before classification. The LCNN+CQCC detector follows the same multi-class attribution setting but uses constant-Q cepstral coefficient (CQCC) features with a light convolutional neural network. The BiLSTM detector is included as a temporal sequence-based model to test whether attacks optimized against AASIST transfer to recurrent architectures.
For the multi-class detectors, the true label corresponds to the source vocoder of the evaluated sample. An attack is considered successful if the detector no longer predicts the correct vocoder class. For AASIST, an attack is considered successful if synthetic speech is classified as real.

4.3. Attack Configuration

Four adversarial attacks are evaluated: FGSM, BIM, PGD, and CW. Each attack is implemented in both waveform-domain and STFT-magnitude-domain settings. All attacks are optimized against AASIST using a targeted fake-to-real objective and saved as universal perturbation checkpoints.
Table 1. Attack hyperparameters used for waveform-domain and STFT-magnitude-domain attacks.
Table 1. Attack hyperparameters used for waveform-domain and STFT-magnitude-domain attacks.
Parameter Waveform domain STFT-magnitude domain
Sampling rate 22 , 050 Hz 22 , 050 Hz
Chunk length 4 s / 88 , 200 samples 4 s / 88 , 200 samples
STFT n fft 1024
STFT hop length 256
STFT window length 1024
Perturbation budget ϵ = 0.003 / 0.006 ϵ rel = 0.03 / 0.05
BIM step size α = 0.0003 α rel = 0.003
PGD step size α = 0.0003 α rel = 0.003
BIM/PGD steps 5000 5000
Batch size 8 8
CW steps 5000 5000
CW learning rate 0.001 0.001
CW confidence margin κ = 0 κ = 0
CW loss weight c = 1.0 c = 1.0
For waveform-domain attacks, the perturbation is added directly to the audio waveform. The universal perturbation is optimized over fixed-length chunks of 4 seconds. Since the sampling rate is 22 , 050 Hz, each waveform perturbation contains:
T δ = 4 × 22 , 050 = 88 , 200
samples. The adversarial waveform is clipped to the valid amplitude range [ 1 , 1 ] after perturbation.
For STFT-magnitude-domain attacks, the perturbation is applied to the magnitude spectrum while preserving the original phase for inverse STFT reconstruction. The STFT uses n fft = 1024 , hop length 256, and window length 1024. The magnitude perturbation is constrained using a relative perturbation budget based on a reference magnitude template computed from the source data.
FGSM is implemented as a one-step universal attack. BIM and PGD are iterative projected-gradient attacks, with PGD using random initialization within the allowed perturbation range. CW is implemented as an optimization-based attack using a targeted logit-margin objective and Adam optimization. During iterative attack optimization, random cropping is used so that the universal perturbation is exposed to different temporal regions of the source utterances.
The perturbation budgets are domain-specific because waveform-domain and STFT-magnitude-domain perturbations operate on different signal representations and are not directly comparable by their raw ϵ values. Therefore, cross-domain comparisons are interpreted together with WER and SNR results, which provide a common quality-based reference for the amount of audio degradation introduced by each attack.

4.4. Evaluation Protocol

The evaluation consists of three stages. First, each attack is optimized using HiFi-GAN-generated speech and AASIST as the source detector. Second, the saved universal perturbation checkpoint is applied to HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN audio without additional optimization. Third, the clean and adversarial audio are evaluated on AASIST, ResNet+LFCC, LCNN+CQCC, and BiLSTM.
For each detector and vocoder, clean accuracy is first recorded to verify that the detector performs reliably before attack. Attack success rate (ASR) is then computed using the definitions from Section 3. For AASIST, ASR measures fake-to-real misclassification. For multi-class detectors, ASR measures incorrect vocoder attribution.
Audio quality and intelligibility are evaluated using Whisper [30] word error rate (WER) and signal-to-noise ratio (SNR). These metrics are reported alongside ASR to distinguish quality-preserving adversarial examples from attacks that mainly succeed through signal degradation.

5. Results

5.1. Frequency-Domain Attack Results

Table 2 reports the ASR of STFT-magnitude-domain attacks across the four detectors and four vocoders. The strongest result is observed for PGD on ResNet+LFCC, where ASR reaches approximately 100 % across all vocoders. Since the attack is optimized against AASIST rather than ResNet+LFCC, this indicates strong black-box transfer from the AASIST source detector to the LFCC-based attribution detector.
However, this transfer does not generalize uniformly across detectors. The same STFT-magnitude PGD perturbation has limited effect on LCNN+CQCC and no measurable effect on BiLSTM. For AASIST, PGD achieves moderate source-detector ASR, ranging from 26 % to 53 % across vocoders. FGSM, BIM, and CW remain weak in the STFT-magnitude setting, with low ASR on both AASIST and most black-box detectors. Overall, the frequency-domain results show that STFT-magnitude PGD is effective mainly against LFCC-based attribution, while the other attacks are largely ineffective under the tested budget.

5.2. Waveform-Domain Attack Results

Table 3 reports the ASR of waveform-domain attacks. Compared with the STFT-magnitude setting, waveform-domain perturbations produce broader transferability across feature-based detectors. LCNN+CQCC is the most vulnerable target, with all four waveform attacks reaching 100 % ASR across all vocoders. ResNet+LFCC is also highly vulnerable, with most attacks producing near-complete attribution failure.
The BiLSTM detector shows a more mixed pattern. It is highly vulnerable for HiFi-GAN, but its response varies substantially for the other vocoders. For example, Fullband MelGAN remains relatively robust to PGD, FGSM, and BIM, while CW reaches high ASR. This indicates that BiLSTM transferability depends more strongly on the interaction between attack type and source vocoder.
For AASIST, waveform-domain BIM and CW are the strongest attacks. BIM reaches 100 % ASR across all vocoders, while CW achieves consistently high ASR. PGD produces moderate ASR, whereas FGSM fails completely against AASIST. However, FGSM still transfers strongly to ResNet+LFCC and LCNN+CQCC, showing that source-detector success does not necessarily predict black-box transferability.

5.3. Cross-Domain Comparison

Table 4 summarizes the average ASR across vocoders for both perturbation domains. The results show a clear domain effect. In the STFT-magnitude domain, high ASR is concentrated mainly in PGD against ResNet+LFCC. In the waveform domain, high ASR is distributed more broadly across attacks and detectors, especially for ResNet+LFCC and LCNN+CQCC.
The ranking of attacks also changes across domains. For AASIST, STFT-magnitude PGD is the strongest frequency-domain attack, while waveform-domain BIM and CW are much stronger in the waveform setting. For BiLSTM, STFT-magnitude attacks have almost no effect, whereas waveform-domain attacks produce higher but more variable ASR. These results show that perturbation domain is a major factor in adversarial effectiveness and transferability.

5.4. Black-Box Transferability Across Detectors

Table 5 compares source-detector ASR with black-box transfer ASR averaged across vocoders. Since all attacks are optimized against AASIST, the AASIST column represents source-detector performance, while ResNet+LFCC, LCNN+CQCC, and BiLSTM represent black-box target detectors.
In the STFT-magnitude domain, PGD achieves the strongest source-detector ASR and transfers almost completely to ResNet+LFCC, but it transfers poorly to LCNN+CQCC and BiLSTM. The other STFT-magnitude attacks show weak source and transfer performance. In contrast, waveform-domain BIM and CW achieve both strong source ASR and high average black-box ASR. Waveform-domain FGSM is the main exception: it achieves 0.00 % ASR on AASIST but high average black-box ASR, indicating that black-box transferability must be evaluated separately from source-model success.

5.5. Audio Quality and Intelligibility

Table 6 reports Whisper WER, Table 7 reports SNR, and Table 8 summarizes the average WER and SNR across vocoders. The clean vocoded audio has an average WER of 3.96 % , providing the baseline for intelligibility comparison.
For STFT-magnitude attacks, FGSM, BIM, and CW preserve audio quality well. Their WER remains close to the clean baseline and their SNR values are high, but their ASR values are low in most detector settings. STFT-magnitude PGD produces stronger transfer to ResNet+LFCC, but it increases average WER to 5.24 % and reduces average SNR to 11.20 dB. This indicates a stronger distortion–effectiveness trade-off for PGD in the frequency domain.
For waveform-domain attacks, FGSM and BIM achieve high ASR on several detectors but substantially degrade audio quality. FGSM increases average WER to 17.53 % with an average SNR of 3.33 dB, while BIM increases average WER to 12.84 % with an average SNR of 6.59 dB. These results suggest that their high ASR should be interpreted cautiously. In contrast, waveform-domain PGD and CW keep WER close to the clean baseline, with average WER values of 4.03 % and 4.07 % , respectively. Together with their strong ASR results, waveform-domain PGD and CW provide the strongest quality-aware attack performance.

6. Discussion

The results show that adversarial robustness in vocoder fingerprint detection is not determined by the detector alone. Instead, it depends on the interaction between the attack algorithm, the perturbation domain, and the detector representation. This is important because a detector that appears robust under one attack setting may still be vulnerable under another.

6.1. Perturbation Domain and Detector Representation

The comparison between waveform-domain and STFT-magnitude-domain attacks shows that perturbation domain has a major effect on adversarial behavior. STFT-magnitude attacks mainly expose representation-specific vulnerability. In particular, STFT-magnitude PGD transfers strongly to ResNet+LFCC but has limited impact on LCNN+CQCC and BiLSTM. This suggests that the perturbation disrupts spectral structures that are highly relevant to LFCC-based attribution, but less damaging to CQCC-based or recurrent detectors.
Waveform-domain attacks produce a different pattern. They transfer more broadly to feature-based detectors, especially ResNet+LFCC and LCNN+CQCC. This indicates that direct waveform perturbations can disrupt downstream feature extraction even when the target detector itself does not operate directly on raw waveform input. Therefore, feature-based detectors should not be assumed to be protected from waveform-level adversarial manipulation.
These findings suggest that robustness evaluation should include more than one perturbation domain. A detector may appear robust against STFT-magnitude perturbations while remaining vulnerable to waveform-domain attacks, or vice versa. The results therefore support a domain-aware view of adversarial robustness, where attack effectiveness must be interpreted in relation to the signal representation being perturbed and the representation used by the target detector.

6.2. Source-Model Success Does Not Guarantee Transferability

All attacks in this study are optimized against AASIST, but the strongest transfer effects do not always correspond to the strongest source-model attacks. The clearest example is waveform-domain FGSM. Although it fails to achieve fake-to-real evasion on AASIST, it still causes substantial attribution failure on black-box feature-based detectors. This shows that source-detector ASR is not sufficient for evaluating the broader risk of an attack.
This result has practical implications for black-box adversarial evaluation. If evaluation only considers the source detector, an attack such as waveform-domain FGSM would appear ineffective. However, when evaluated against independent target detectors, the same perturbation produces strong attribution failure. Therefore, black-box transferability should be evaluated explicitly across independent detectors rather than inferred from source-model performance.
The detector-specific transfer patterns also suggest that different detector families rely on different fingerprint cues. LFCC-based, CQCC-based, recurrent, and raw-waveform detectors do not respond identically to the same adversarial perturbation. This means that robustness claims based on a single detector architecture are limited. A more reliable evaluation should include heterogeneous target detectors with different front-end representations and model structures.

6.3. Quality-Aware Interpretation of Attack Strength

The audio quality results show that high ASR alone is not enough to identify a strong adversarial attack. Some attacks achieve high detector disruption only at the cost of substantial signal degradation. Waveform-domain FGSM and BIM are the main examples: they produce strong black-box ASR, but they also substantially increase WER and produce very low SNR. Their effectiveness should therefore be interpreted cautiously, because part of their success may come from destructive distortion rather than subtle adversarial manipulation.
Waveform-domain PGD and CW provide a stronger trade-off. Both maintain WER close to the clean baseline while still achieving strong attack success across several detector settings. This makes them more convincing as quality-preserving adversarial attacks. In contrast, STFT-magnitude FGSM, BIM, and CW preserve audio quality but are weak in terms of ASR. STFT-magnitude PGD is more effective, especially against ResNet+LFCC, but its low SNR indicates stronger spectral distortion.
These results show why adversarial audio evaluation should report both attack effectiveness and audio quality. Without WER and SNR, waveform-domain FGSM and BIM would appear stronger than they actually are. With quality metrics included, waveform-domain PGD and CW emerge as more meaningful attacks because they better balance detector disruption and speech preservation.

7. Conclusion

This paper presented a comparative study of adversarial attacks against audio deepfake and vocoder fingerprint detectors. Four attacks, FGSM, BIM, PGD, and CW, were evaluated in both waveform-domain and STFT-magnitude-domain settings. All attacks were optimized against AASIST using a targeted fake-to-real objective and were then evaluated for black-box transferability across detectors with different input representations, including ResNet+LFCC, LCNN+CQCC, BiLSTM, and AASIST.
The results show that adversarial effectiveness depends strongly on the interaction between attack algorithm, perturbation domain, and detector representation. STFT-magnitude PGD transfers strongly to ResNet+LFCC but has limited effect on LCNN+CQCC and BiLSTM, indicating that frequency-domain vulnerability is representation-specific. In contrast, waveform-domain attacks produce broader transferability across feature-based detectors, particularly ResNet+LFCC and LCNN+CQCC. However, source-detector success does not always predict black-box transferability, as shown by waveform-domain FGSM, which fails against AASIST but transfers strongly to several black-box detectors.
The quality evaluation further shows that high ASR alone is insufficient for assessing adversarial audio. Waveform-domain FGSM and BIM achieve strong detector disruption but substantially degrade WER and SNR, while waveform-domain PGD and CW preserve intelligibility more effectively while maintaining strong attack performance. These findings highlight the need to evaluate adversarial attacks using both detection metrics and audio quality metrics.
Overall, this study demonstrates that robustness claims in audio deepfake and vocoder fingerprint detection should not be based on a single attack, detector, or perturbation domain. Future work should extend this comparison to additional source detectors, larger and more diverse datasets, additional vocoder families, and defended detection pipelines. Further work should also examine adaptive attacks and defense strategies such as preprocessing-based defenses, adversarial training, detector ensembling, and representation-level regularization.

References

  1. Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
  2. Mustafa, A.; Pia, N.; Fuchs, G. StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6034–6038. [Google Scholar]
  3. Yang, G.; Yang, S.; Liu, K.; Fang, P.; Chen, W.; Xie, L. Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop, Shenzhen, China, 19–22 January 2021; pp. 492–498. [Google Scholar]
  4. Yamamoto, R.; Song, E.; Kim, J.-M. Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
  5. Li, M.; Ahmadiadli, Y.; Zhang, X.-P. A Survey on Speech Deepfake Detection. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
  6. Sisman, B.; Yamagishi, J.; King, S.; Li, H. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
  7. Tan, X.; Qin, T.; Soong, F.; Liu, T.-Y. A Survey on Neural Speech Synthesis. arXiv 2021, arXiv:2106.15561. [Google Scholar] [CrossRef]
  8. Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio Deepfake Detection: A Survey. arXiv 2023, arXiv:2308.14970. [Google Scholar] [CrossRef]
  9. Hasanabadi, M.R. An Overview of Text-to-Speech Systems and Media Applications. arXiv 2023, arXiv:2310.14301. [Google Scholar]
  10. Spanias, A.S. Speech Coding: A Tutorial Review. Proc. IEEE 1994, 82, 1541–1582. [Google Scholar] [CrossRef]
  11. Shaaban, O.A.; Yildirim, R.; Alguttar, A.A. Audio Deepfake Approaches. IEEE Access 2023, 11, 132652–132682. [Google Scholar] [CrossRef]
  12. Zhou, X.; Garcia-Romero, D.; Duraiswami, R.; Espy-Wilson, C.; Shamma, S. Linear versus Mel Frequency Cepstral Coefficients for Speaker Recognition. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011; pp. 559–564. [Google Scholar]
  13. Todisco, M.; Delgado, H.; Evans, N. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients. In Proceedings of the Odyssey 2016: The Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2016; pp. 283–290. [Google Scholar]
  14. Jung, J.-w.; Heo, H.-S.; Tak, H.; Shim, H.-j.; Chung, J.S.; Lee, B.-J.; Yu, H.-J.; Evans, N. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. arXiv 2021, arXiv:2110.01200. [Google Scholar]
  15. Wang, X.; He, K. Enhancing the Transferability of Adversarial Attacks through Variance Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1924–1933. [Google Scholar]
  16. Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. Adversarial Attacks and Defences: A Survey. arXiv 2018, arXiv:1810.00069. [Google Scholar] [CrossRef]
  17. Demontis, A.; Melis, M.; Pintor, M.; Jagielski, M.; Biggio, B.; Oprea, A.; Nita-Rotaru, C.; Roli, F. Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 321–338. [Google Scholar]
  18. Żelasko, P.; Joshi, S.; Shao, Y.; Villalba, J.; Trmal, J.; Dehak, N.; Khudanpur, S. Adversarial Attacks and Defenses for Speech Recognition Systems. arXiv 2021, arXiv:2103.17122. [Google Scholar] [CrossRef]
  19. Sun, C.; Jia, S.; Hou, S.; Lyu, S. AI-Synthesized Voice Detection Using Neural Vocoder Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 904–912. [Google Scholar]
  20. Deng, J.; Ren, Y.; Zhang, T.; Zhu, H.; Sun, Z. VFD-Net: Vocoder Fingerprints Detection for Fake Audio. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 12151–12155. [Google Scholar]
  21. Li, F.; Chen, Y.; Liu, H.; Zhao, Z.; Yao, Y.; Liao, X. Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
  22. Costa, J.C.; Roxo, T.; Proença, H.; Inacio, P.R.M. How Deep Learning Sees the World: A Survey on Adversarial Attacks and Defenses. IEEE Access 2024, 12, 61113–61136. [Google Scholar] [CrossRef]
  23. Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial Machine Learning at Scale. arXiv 2016, arXiv:1611.01236. [Google Scholar]
  24. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
  25. Carlini, N.; Wagner, D. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 3–14. [Google Scholar]
  26. Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial Examples in the Physical World. In Artificial Intelligence Safety and Security; Yampolskiy, R.V., Ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
  27. Papernot, N.; McDaniel, P.; Goodfellow, I. Transferability in Machine Learning: From Phenomena to Black-Box Attacks Using Adversarial Samples. arXiv 2016, arXiv:1605.07277. [Google Scholar] [CrossRef]
  28. Ito, K.; Johnson, L. The LJ Speech Dataset. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 25 May 2026).
  29. Yan, X.; Yi, J.; Tao, J.; Wang, C.; Ma, H.; Wang, T.; Wang, S.; Fu, R. An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, Lisboa, Portugal, 14 October 2022; pp. 61–68. [Google Scholar]
  30. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Table 2. Attack success rate (%) of STFT-magnitude-domain attacks across detectors and vocoders.
Table 2. Attack success rate (%) of STFT-magnitude-domain attacks across detectors and vocoders.
Detector Vocoder Clean Acc. PGD FGSM BIM CW
ResNet+LFCC HiFi-GAN 99.3 99.99 6.00 2.00 6.00
Fullband MelGAN 96.8 99.99 3.00 4.00 6.00
StyleMelGAN 98.7 100.00 4.00 2.00 0.00
Parallel WaveGAN 100.0 99.99 0.00 0.00 0.00
LCNN+CQCC HiFi-GAN 96.0 0.00 7.00 8.00 6.00
Fullband MelGAN 95.0 0.00 27.00 15.00 10.00
StyleMelGAN 100.0 0.00 2.00 2.00 1.00
Parallel WaveGAN 100.0 15.00 0.00 0.00 0.00
BiLSTM HiFi-GAN 99.79 0.00 1.91 0.74 1.38
Fullband MelGAN 99.47 0.00 4.03 1.70 1.17
StyleMelGAN 100.0 0.00 0.11 0.42 0.00
Parallel WaveGAN 100.0 0.00 0.00 0.00 0.00
AASIST HiFi-GAN 94.0 34.00 6.69 6.37 6.37
Fullband MelGAN 93.0 45.00 8.60 8.07 8.28
StyleMelGAN 97.0 26.00 3.40 3.08 3.29
Parallel WaveGAN 94.0 53.00 6.79 6.37 6.69
Table 3. Attack success rate (%) of waveform-domain attacks across detectors and vocoders.
Table 3. Attack success rate (%) of waveform-domain attacks across detectors and vocoders.
Detector Vocoder Clean Acc. PGD FGSM BIM CW
ResNet+LFCC HiFi-GAN 99.3 100.0 100.0 100.0 100.0
Fullband MelGAN 96.8 92.0 100.0 100.0 100.0
StyleMelGAN 98.7 100.0 1.0 100.0 100.0
Parallel WaveGAN 100.0 100.0 100.0 95.0 100.0
LCNN+CQCC HiFi-GAN 96.0 100.0 100.0 100.0 100.0
Fullband MelGAN 95.0 100.0 100.0 100.0 100.0
StyleMelGAN 100.0 100.0 100.0 100.0 100.0
Parallel WaveGAN 100.0 100.0 100.0 100.0 100.0
BiLSTM HiFi-GAN 99.79 99.79 100.0 100.0 100.0
Fullband MelGAN 99.47 9.45 0.42 1.06 99.79
StyleMelGAN 100.0 32.59 98.62 96.92 100.0
Parallel WaveGAN 100.0 0.32 100.0 100.0 9.87
AASIST HiFi-GAN 94.0 37.0 0.0 100.0 85.35
Fullband MelGAN 93.0 39.0 0.0 100.0 89.60
StyleMelGAN 97.0 24.0 0.0 100.0 71.66
Parallel WaveGAN 94.0 38.0 0.0 100.0 87.69
Table 4. Average ASR (%) across vocoders for STFT-magnitude and waveform-domain attacks.
Table 4. Average ASR (%) across vocoders for STFT-magnitude and waveform-domain attacks.
Domain Detector PGD FGSM BIM CW
STFT-mag ResNet+LFCC 99.99 3.25 2.00 3.00
LCNN+CQCC 3.75 9.00 6.25 4.25
BiLSTM 0.00 1.51 0.72 0.64
AASIST 39.50 6.37 5.97 6.16
Waveform ResNet+LFCC 98.00 75.25 98.75 100.00
LCNN+CQCC 100.00 100.00 100.00 100.00
BiLSTM 35.54 74.76 74.50 77.42
AASIST 34.50 0.00 100.00 83.58
Table 5. Source-detector ASR and black-box transfer ASR averaged across vocoders. AASIST is the source detector, while ResNet+LFCC, LCNN+CQCC, and BiLSTM are black-box target detectors.
Table 5. Source-detector ASR and black-box transfer ASR averaged across vocoders. AASIST is the source detector, while ResNet+LFCC, LCNN+CQCC, and BiLSTM are black-box target detectors.
Domain Attack AASIST ASR ResNet+LFCC LCNN+CQCC BiLSTM Avg. Black-box ASR
STFT-mag PGD 39.50 99.99 3.75 0.00 34.58
STFT-mag FGSM 6.37 3.25 9.00 1.51 4.59
STFT-mag BIM 5.97 2.00 6.25 0.72 2.99
STFT-mag CW 6.16 3.00 4.25 0.64 2.63
Waveform PGD 34.50 98.00 100.00 35.54 77.85
Waveform FGSM 0.00 75.25 100.00 74.76 83.34
Waveform BIM 100.00 98.75 100.00 74.50 91.08
Waveform CW 83.58 100.00 100.00 77.42 92.47
Table 6. Whisper WER (%) of clean and adversarial audio across vocoders.
Table 6. Whisper WER (%) of clean and adversarial audio across vocoders.
Domain Attack HiFi-GAN Fullband MelGAN StyleMelGAN Parallel WaveGAN Avg. WER
Clean None 3.74 3.99 4.12 3.97 3.96
STFT-mag FGSM 3.72 3.97 4.03 3.99 3.93
BIM 3.72 3.98 4.02 4.06 3.95
CW 3.74 4.01 4.05 4.01 3.95
PGD 4.65 5.50 5.51 5.28 5.24
Waveform FGSM 16.28 17.83 16.05 19.94 17.53
BIM 11.77 13.92 12.35 13.30 12.84
CW 3.85 4.18 4.17 4.09 4.07
PGD 3.84 4.12 4.16 3.98 4.03
Table 7. Signal-to-noise ratio (SNR, dB) of adversarial audio across vocoders. Values are reported as mean ± standard deviation.
Table 7. Signal-to-noise ratio (SNR, dB) of adversarial audio across vocoders. Values are reported as mean ± standard deviation.
Domain Vocoder FGSM BIM PGD CW
STFT-mag HiFi-GAN 41.69 ± 1.24 46.31 ± 1.24 11.00 ± 1.28 49.40 ± 1.29
Fullband MelGAN 41.80 ± 1.27 46.50 ± 1.27 10.90 ± 1.28 49.50 ± 1.33
StyleMelGAN 42.80 ± 1.29 47.40 ± 1.28 11.90 ± 1.25 50.48 ± 1.37
Parallel WaveGAN 42.00 ± 1.28 46.50 ± 1.28 11.00 ± 1.28 49.70 ± 1.34
Waveform HiFi-GAN 2.95 ± 1.22 6.20 ± 1.23 32.90 ± 1.22 23.90 ± 1.24
Fullband MelGAN 3.13 ± 1.26 6.39 ± 1.27 33.10 ± 1.26 24.10 ± 1.28
StyleMelGAN 4.09 ± 1.25 7.35 ± 1.28 34.10 ± 1.21 25.00 ± 1.23
Parallel WaveGAN 3.16 ± 1.27 6.42 ± 1.28 33.20 ± 1.26 24.10 ± 1.28
Table 8. Average WER and SNR across vocoders.
Table 8. Average WER and SNR across vocoders.
Domain Attack Avg. WER Δ WER Avg. SNR
STFT-mag FGSM 3.93 0.03 42.07
STFT-mag BIM 3.95 0.01 46.68
STFT-mag PGD 5.24 1.28 11.20
STFT-mag CW 3.95 0.00 49.77
Waveform FGSM 17.53 13.57 3.33
Waveform BIM 12.84 8.88 6.59
Waveform PGD 4.03 0.07 33.33
Waveform CW 4.07 0.12 24.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated