Evaluating Adversarial Robustness of Deepfake Audio Detectors and Vocoder Fingerprint Detectors Against Universal Adversarial Perturbations

Quang Minh Tran; Wei Zong; Yang-Wai Chow; Willy Susilo

doi:10.20944/preprints202606.0272.v1

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract

Audio deepfake and vocoder fingerprint detectors are increasingly used to identify synthetic speech and attribute it to its generating model. However, their robustness against adversarial perturbations remains unclear across different attack algorithms, perturbation domains, and detector representations. This paper presents a comparative study of four adversarial attacks, namely Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and Carlini & Wagner (CW), against audio deepfake and vocoder fingerprint detectors. Each attack is implemented in both waveform-domain and STFT-magnitude-domain settings. All attacks are optimized against AASIST using a targeted fake-to-real objective and are evaluated on synthetic speech generated by HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN. Black-box transferability is assessed across multiple detector families, including AASIST, ResNet with LFCC features, LCNN with CQCC features, and a BiLSTM-based detector. The results show that adversarial effectiveness depends strongly on perturbation domain and detector representation. STFT-magnitude PGD transfers strongly to LFCC-based ResNet detectors but has limited effect on CQCC-based and recurrent detectors. In contrast, waveform-domain attacks produce broader transferability, although FGSM and BIM substantially degrade audio quality. To distinguish effective adversarial perturbations from destructive signal degradation, we evaluate audio quality and intelligibility using word error rates and signal-to-noise ratio. Overall, the findings show that robustness claims in audio deepfake detection are limited when considering adversarial perturbation.

Keywords:

adversarial attacks

;

audio deepfake detection

;

vocoder fingerprints

;

black-box transferability

;

STFT-magnitude perturbation

;

waveform-domain perturbation

;

speech Anti-spoofing

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Neural vocoders have become a central component of modern speech synthesis systems, enabling the generation of highly natural speech waveforms from intermediate acoustic representations such as mel-spectrograms. Models such as HiFi-GAN, MelGAN variants, StyleMelGAN, and Parallel WaveGAN [1,2,3,4] can produce synthetic speech with high perceptual quality, making synthetic audio increasingly difficult to distinguish from genuine recordings. While these systems support legitimate applications in text-to-speech and voice conversion, they also increase the risk of audio deepfakes being used for impersonation, misinformation, and fraud.

To address this risk, audio deepfake detection methods aim to distinguish synthetic speech from real speech. Beyond binary real-versus-fake detection, a related forensic task is vocoder fingerprint detection, where the objective is to identify the vocoder or synthesis model responsible for generating a speech sample. The motivation behind vocoder fingerprinting is that different neural vocoders may leave model-specific artifacts in the generated waveform [29]. These artifacts can appear in spectral structure, cepstral patterns, temporal dynamics, or other signal characteristics learned by the detector. If reliable, vocoder fingerprints can support both synthetic speech detection and source attribution.

However, the reliability of detecting vocoder fingerprints depends on whether these artifacts remain stable under perturbation. A detector may achieve high accuracy on clean synthetic speech but fails when the input is intentionally modified to disrupt the fingerprint cues used for classification. This is particularly important in adversarial settings, where an attacker may attempt to preserve the perceived speech content while causing the detector to misclassify the sample. Therefore, evaluating clean accuracy alone is insufficient for assessing the practical robustness of vocoder fingerprint detectors.

Adversarial attacks provide a direct way to examine this vulnerability [15,16]. In audio, adversarial perturbations can be applied in different domains. Waveform-domain attacks modify the raw audio samples directly, while frequency-domain attacks modify a time-frequency representation such as the STFT magnitude before reconstructing the waveform. These perturbation domains may interact differently with detector architectures. For example, waveform-based detectors such as AASIST [14] may rely on raw temporal cues, while feature-based detectors using LFCC or CQCC [12,13] representations may be more sensitive to spectral and cepstral distortions. As a result, an attack that is effective against one detector may not be able to transfer to another.

This transferability problem is central to realistic adversarial evaluation [17]. In practice, an attacker may not know the exact detector used by a forensic system. If adversarial audio optimized against one detector also fools other detectors, this indicates a broader weakness across detector families. Conversely, if transferability is limited, this suggests that robustness may depend strongly on the detector representation. Existing evaluations [17] often focus on a single detector, feature type, attack algorithm, or perturbation domain, making it difficult to determine whether observed vulnerabilities generalize across vocoder fingerprint detectors.

This paper presents a comparative study of adversarial attacks against audio deepfake and vocoder fingerprint detectors. We evaluate four representative attacks: Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and Carlini–Wagner (CW). Each attack is implemented in both waveform-domain and STFT-magnitude-domain settings. All attacks are optimized against AASIST using a targeted fake-to-real objective, and the resulting adversarial audio is evaluated for black-box transferability across multiple detector families, including ResNet with LFCC features, LCNN with CQCC features, a BiLSTM-based detector, and AASIST itself. Experiments are conducted on synthetic speech generated by HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN.

The contributions of this paper are threefold. First, we provide a controlled comparison of FGSM, BIM, PGD, and CW attacks under a unified AASIST-source optimization setting. Second, we compare waveform-domain and STFT-magnitude-domain perturbations to examine how perturbation domain affects attack success and black-box transferability. Third, we evaluate transferability across detectors with different input representations and architectures, while also measuring audio quality and intelligibility using word error rates and signal-to-noise ratio. Through this analysis, we show that adversarial robustness in vocoder fingerprint detection depends jointly on attack algorithm, perturbation domain, and detector representation.

2. Related Work

2.1. Audio Deepfake Detection and Vocoder Fingerprints

Audio deepfake detection aims to determine whether a speech signal is genuine or synthetically generated. Existing detectors commonly approach this task either as binary real-versus-fake classification or as a more fine-grained source attribution problem. In the binary setting, the detector only decides whether an utterance is real or synthetic. [14,19] In the source attribution setting, the detector attempts to identify the synthesis model or vocoder that produced the speech. The latter task is closely related to vocoder fingerprint detection.

Vocoder fingerprints refer to model-specific traces that remain in generated speech after waveform synthesis [19,20,21,29]. These traces may arise from the architecture, training objective, upsampling strategy, spectral reconstruction loss, or adversarial training procedure used by a vocoder. For example, GAN-based vocoders such as HiFi-GAN, MelGAN variants, StyleMelGAN, and Parallel WaveGAN reconstruct waveform details using different generator and discriminator designs. As a result, their outputs may contain distinguishable spectral or temporal artifacts that can be exploited by fingerprint detectors.

The existence of vocoder fingerprints makes source attribution possible, but it also raises an important robustness question. If the detector relies on subtle synthesis artifacts, then small adversarial perturbations may be able to disrupt those cues while preserving the apparent speech content. Therefore, clean classification accuracy alone does not establish that vocoder fingerprints are stable or reliable in adversarial settings. This motivates evaluating whether fingerprint detectors remain effective when synthetic audio is deliberately perturbed.

2.2. Detector Representations for Synthetic Speech Detection

Audio deepfake detectors differ in the representations they use to analyze speech. Feature-based detectors first transform the waveform into an acoustic feature representation and then classify the extracted features. Common front-end features include mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCCs), and constant-Q cepstral coefficients (CQCCs). These features compress spectral information into lower-dimensional representations that can be used by convolutional or recurrent classifiers.

LFCC and CQCC features are particularly relevant to vocoder fingerprint detection because they capture different frequency structures. LFCCs [12] use linearly spaced filters, which can preserve high-frequency and narrowband artifacts that may be useful for detecting synthetic speech. CQCCs [13] are based on the constant-Q transform and use a logarithmic frequency structure, which may emphasize different spectral patterns. As a result, perturbations that disrupt LFCC-based detectors may not necessarily transfer to CQCC-based detectors.

End-to-end detectors follow a different approach by operating directly on raw waveform inputs or minimally processed representations. These models learn task-specific features from the signal rather than relying on a fixed cepstral front end. AASIST is an example of this direction and is designed for speech anti-spoofing using raw waveform input. Compared with fixed-feature detectors, end-to-end detectors may exploit temporal, spectral, or phase-related cues that are not explicitly represented in hand-crafted features.

These differences in detector representation are central to adversarial transferability. An adversarial perturbation optimized against a raw-waveform detector may not affect a cepstral-feature detector in the same way, and an attack that disrupts LFCC features may not generalize to CQCC features or recurrent temporal models. For this reason, the present study evaluates multiple detector families, including AASIST, ResNet+LFCC, LCNN+CQCC, and a BiLSTM-based detector.

2.3. Adversarial Attacks and Transferability

Adversarial attacks modify an input signal to cause a machine learning model to make an incorrect prediction [16,22]. In audio deepfake detection, this means perturbing a speech waveform so that a detector fails to recognize it as synthetic or fails to attribute it to the correct vocoder. Such attacks are especially concerning when the perturbation preserves speech intelligibility while disrupting detector-relevant artifacts.

Gradient-based attacks provide a common framework for evaluating model robustness. FGSM [23] applies a single gradient-sign update, while BIM [26] extends this idea through multiple iterative updates. PGD [24] further strengthens the iterative attack by using random initialization and projection within a bounded perturbation region. CW [25] uses an optimization-based objective that directly manipulates the model logits while balancing classification success and perturbation size. These attacks differ in computational cost, optimization strength, and sensitivity to hyperparameters.

In the audio domain, adversarial perturbations can be applied in different signal domains. Waveform-domain attacks directly modify the audio samples, while frequency-domain attacks modify spectral representations such as STFT magnitude before reconstructing the waveform. This distinction is important because detectors do not all rely on the same signal evidence. Raw-waveform models may be affected by temporal or phase-related perturbations, while cepstral-feature detectors may be more sensitive to spectral magnitude changes.

Black-box transferability [27] is also important for practical evaluation. A perturbation optimized against one source detector may or may not transfer to other detectors with different representations and architectures. If transferability is high, the attack reveals a broader weakness across detector families. If transferability is limited, the vulnerability may be specific to the source model or feature representation. Existing studies often focus on a single attack setting or detector type, leaving the relationship between attack algorithm, perturbation domain, and detector representation insufficiently understood.

This work addresses that gap by comparing FGSM, BIM, PGD, and CW in both waveform and STFT-magnitude domains. All attacks are optimized against AASIST using a targeted fake-to-real objective and then evaluated across multiple vocoder fingerprint detectors. This design allows the study to examine not only source-detector attack success, but also cross-detector and cross-vocoder transferability.

3. Methodology

3.1. Overview

This study compares the robustness of audio deepfake and vocoder fingerprint detectors under multiple adversarial attack settings. Four attacks are evaluated: FGSM, BIM, PGD, and CW. Each attack is implemented in two perturbation domains: the waveform domain and the STFT-magnitude domain.

All attacks are optimized against AASIST using a targeted fake-to-real objective. The learned adversarial perturbations are then applied to synthetic speech generated by multiple vocoders and evaluated on both the source detector and independent target detectors. This design allows us to examine source-detector attack success, black-box transferability, and the effect of perturbation domain on detector robustness.

3.2. Problem Formulation

Let

x \in {[- 1, 1]}^{T}

denote a clean synthetic speech waveform of length T. The source detector is denoted as

f_{s} (\cdot)

, where

f_{s}

corresponds to AASIST in this study. AASIST is treated as a binary classifier with two classes:

0 = fake, 1 = real .

(1)

For a synthetic input x, the true label is

y = 0

, and the target adversarial label is

y_{t} = 1

. The objective is to generate an adversarial waveform

x_{adv}

such that the source detector predicts the target class while the audio remains close to the original waveform:

x_{adv} = A (x; δ),

(2)

where

A (\cdot)

denotes the attack transformation and

δ

denotes the adversarial perturbation. The form of

δ

depends on the perturbation domain. In the waveform domain,

δ

is added directly to the raw audio samples. In the STFT-magnitude domain,

δ

is applied to the magnitude spectrum before waveform reconstruction.

3.3. Universal Perturbation Setting

All attacks are implemented as universal perturbations. Instead of optimizing a separate perturbation for each utterance, each attack learns one reusable perturbation from a source vocoder dataset:

D_{src} = {x_{1}, x_{2}, \dots, x_{N}} .

(3)

The universal perturbation is optimized as:

δ^{*} = arg min_{δ \in S} \frac{1}{N} \sum_{i = 1}^{N} L (f_{s} (A (x_{i}; δ)), y_{t}),

(4)

where

δ^{*}

is the optimized perturbation,

S

is the allowed perturbation set, and

L (\cdot)

is the attack loss. For FGSM, BIM, and PGD, cross-entropy loss toward the target real class is used. For CW, a logit-margin loss is used.

After optimization, the same perturbation

δ^{*}

is applied to evaluation audio without re-optimization:

x_{adv} = A (x; δ^{*}) .

(5)

This universal setting creates a stricter transferability evaluation than per-sample optimization because the perturbation must generalize across utterances and vocoders.

3.4. Perturbation Domains

3.4.1. Waveform-Domain Perturbation

In the waveform-domain setting, the perturbation is added directly to the raw waveform:

x_{adv} = clip (x + δ_{w}, - 1, 1),

(6)

where

δ_{w}

is the waveform-domain perturbation. The clipping operation ensures that the adversarial waveform remains within the normalized amplitude range. The perturbation is constrained by an

ℓ_{\infty}

bound:

∥ δ_{w} ∥_{\infty} \leq ϵ .

(7)

This setting evaluates whether direct sample-level perturbations can disrupt detectors that operate on raw waveforms or on features extracted from waveforms.

3.4.2. STFT-Magnitude-Domain Perturbation

In the STFT-magnitude-domain setting, the perturbation is applied to the magnitude spectrum. Given a waveform x, its short-time Fourier transform is:

X = STFT (x) .

(8)

The complex STFT is decomposed into magnitude and phase:

M = | X |, P = \frac{X}{| X | + η},

(9)

where

η

is a small constant used to avoid division by zero. The adversarial magnitude is computed as:

M_{adv} = M + δ_{M},

(10)

where

δ_{M}

is the STFT-magnitude perturbation. The adversarial waveform is reconstructed using the original phase:

x_{adv} = ISTFT (M_{adv} \cdot P) .

(11)

Since STFT magnitudes are not bounded within a fixed amplitude range, a relative perturbation budget is used:

| δ_{M} | \leq ϵ_{rel} \cdot M_{ref},

(12)

where

M_{ref}

is a reference magnitude template computed from the source training set, and

ϵ_{rel}

controls the perturbation strength.

3.5. Attack Algorithms

The four attacks differ in how they optimize the universal perturbation.

FGSM performs a single targeted gradient-sign update:

δ^{*} = - ϵ \cdot sign (\nabla_{δ} L_{CE} (f_{s} (A (x; δ)), y_{t})) .

(13)

BIM extends FGSM by applying iterative projected updates:

δ^{k + 1} = Π_{S} (δ^{k} - α \cdot sign (\nabla_{δ} L_{CE} (f_{s} (A (x; δ^{k})), y_{t}))),

(14)

where

α

is the step size and

Π_{S} (\cdot)

projects the perturbation back into the allowed set

S

.

PGD follows the same iterative update rule as BIM but starts from a random initialization within the allowed perturbation set:

δ^{0} \sim U (S) .

(15)

CW is implemented as an optimization-based targeted attack using a logit-margin objective. Let

z_{fake}

and

z_{real}

denote the fake and real logits produced by AASIST. The CW loss is:

L_{CW} = max (z_{fake} - z_{real} + κ, 0),

(16)

where

κ

is a confidence margin. The total CW objective is:

L_{total} = D (δ) + c \cdot L_{CW},

(17)

where

D (δ)

penalizes perturbation size and c controls the weight of the classification loss.

3.6. Transferability Evaluation

The optimized adversarial audio is evaluated under two criteria: source-detector evasion and black-box transferability. For AASIST, attack success is defined as fake audio being classified as real:

{ASR}_{AASIST} = \frac{N_{fake \to real}}{N},

(18)

where

N_{fake \to real}

is the number of fake samples predicted as real, and N is the total number of evaluated samples.

For multi-class vocoder attribution detectors, attack success is defined as incorrect vocoder attribution:

{ASR}_{multi} = 1 - \frac{N_{correct}}{N},

(19)

where

N_{correct}

is the number of attacked samples still classified as their true vocoder class. This definition treats any incorrect vocoder prediction as a successful disruption of source attribution.

3.7. Audio Quality and Intelligibility Metrics

Attack success alone is insufficient because a detector may fail when the audio is heavily degraded. Therefore, adversarial effectiveness is evaluated together with speech intelligibility and signal distortion.

Word error rate (WER) is used to measure transcription-level intelligibility:

WER = \frac{S + D + I}{N_{w}},

(20)

where S, D, and I are the numbers of substitutions, deletions, and insertions, respectively, and

N_{w}

is the number of words in the reference transcript. The WER increase is defined as:

Δ WER = {WER}_{adv} - {WER}_{clean} .

(21)

Signal-to-noise ratio (SNR) is used to measure signal-level distortion between clean and adversarial waveforms:

SNR = 10 {log}_{10} (\frac{{∥ x ∥}_{2}^{2}}{∥ x - x_{adv} ∥_{2}^{2} + η}),

(22)

where

η

is a small constant used to avoid division by zero. Higher SNR indicates that the adversarial audio remains closer to the clean waveform.

4. Experimental Setup

4.1. Dataset and Vocoders

Experiments are conducted using synthetic speech derived from the LJSpeech corpus [28]. LJSpeech contains approximately

13, 100

English utterances from a single female speaker recorded at

22.05

kHz. This provides a controlled single-speaker setting for evaluating vocoder-generated speech and adversarial perturbations.

Four neural vocoders are evaluated: HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN. HiFi-GAN is used as the source vocoder for adversarial optimization, while all four vocoders are used during evaluation. This setup allows the experiment to measure both same-vocoder performance and cross-vocoder transferability. After an attack is optimized on HiFi-GAN-generated speech, the resulting universal perturbation is applied to all vocoder outputs without re-optimization.

All adversarial audio is generated at

22.05

kHz. For detectors requiring a different sampling rate, such as AASIST, audio is resampled internally during evaluation. The attacks are applied after waveform synthesis, and no vocoder model weights are modified.

4.2. Detector Configuration

Four detector configurations are used: AASIST, ResNet+LFCC, LCNN+CQCC, and BiLSTM. These detectors are selected to cover different input representations and model architectures.

AASIST is used as the source detector for attack optimization. It is treated as a binary fake-versus-real classifier, where the target attack objective is to classify synthetic speech as real. The remaining detectors are used to evaluate black-box transferability.

The ResNet+LFCC detector is a multi-class vocoder attribution model referenced from baseline work [29]. Each waveform is converted into linear frequency cepstral coefficient (LFCC) features before classification. The LCNN+CQCC detector follows the same multi-class attribution setting but uses constant-Q cepstral coefficient (CQCC) features with a light convolutional neural network. The BiLSTM detector is included as a temporal sequence-based model to test whether attacks optimized against AASIST transfer to recurrent architectures.

For the multi-class detectors, the true label corresponds to the source vocoder of the evaluated sample. An attack is considered successful if the detector no longer predicts the correct vocoder class. For AASIST, an attack is considered successful if synthetic speech is classified as real.

4.3. Attack Configuration

Four adversarial attacks are evaluated: FGSM, BIM, PGD, and CW. Each attack is implemented in both waveform-domain and STFT-magnitude-domain settings. All attacks are optimized against AASIST using a targeted fake-to-real objective and saved as universal perturbation checkpoints.

Table 1. Attack hyperparameters used for waveform-domain and STFT-magnitude-domain attacks.

Parameter	Waveform domain	STFT-magnitude domain
Sampling rate	$22, 050$ Hz	$22, 050$ Hz
Chunk length	4 s / $88, 200$ samples	4 s / $88, 200$ samples
STFT $n_{fft}$	–	1024
STFT hop length	–	256
STFT window length	–	1024
Perturbation budget	$ϵ = 0.003$ / $0.006$	$ϵ_{rel} = 0.03$ / $0.05$
BIM step size	$α = 0.0003$	$α_{rel} = 0.003$
PGD step size	$α = 0.0003$	$α_{rel} = 0.003$
BIM/PGD steps	5000	5000
Batch size	8	8
CW steps	5000	5000
CW learning rate	$0.001$	$0.001$
CW confidence margin	$κ = 0$	$κ = 0$
CW loss weight	$c = 1.0$	$c = 1.0$

For waveform-domain attacks, the perturbation is added directly to the audio waveform. The universal perturbation is optimized over fixed-length chunks of 4 seconds. Since the sampling rate is

22, 050

Hz, each waveform perturbation contains:

T_{δ} = 4 \times 22, 050 = 88, 200

(23)

samples. The adversarial waveform is clipped to the valid amplitude range

[- 1, 1]

after perturbation.

For STFT-magnitude-domain attacks, the perturbation is applied to the magnitude spectrum while preserving the original phase for inverse STFT reconstruction. The STFT uses

n_{fft} = 1024

, hop length 256, and window length 1024. The magnitude perturbation is constrained using a relative perturbation budget based on a reference magnitude template computed from the source data.

FGSM is implemented as a one-step universal attack. BIM and PGD are iterative projected-gradient attacks, with PGD using random initialization within the allowed perturbation range. CW is implemented as an optimization-based attack using a targeted logit-margin objective and Adam optimization. During iterative attack optimization, random cropping is used so that the universal perturbation is exposed to different temporal regions of the source utterances.

The perturbation budgets are domain-specific because waveform-domain and STFT-magnitude-domain perturbations operate on different signal representations and are not directly comparable by their raw

ϵ

values. Therefore, cross-domain comparisons are interpreted together with WER and SNR results, which provide a common quality-based reference for the amount of audio degradation introduced by each attack.

4.4. Evaluation Protocol

The evaluation consists of three stages. First, each attack is optimized using HiFi-GAN-generated speech and AASIST as the source detector. Second, the saved universal perturbation checkpoint is applied to HiFi-GAN, Fullband MelGAN, StyleMelGAN, and Parallel WaveGAN audio without additional optimization. Third, the clean and adversarial audio are evaluated on AASIST, ResNet+LFCC, LCNN+CQCC, and BiLSTM.

For each detector and vocoder, clean accuracy is first recorded to verify that the detector performs reliably before attack. Attack success rate (ASR) is then computed using the definitions from Section 3. For AASIST, ASR measures fake-to-real misclassification. For multi-class detectors, ASR measures incorrect vocoder attribution.

Audio quality and intelligibility are evaluated using Whisper [30] word error rate (WER) and signal-to-noise ratio (SNR). These metrics are reported alongside ASR to distinguish quality-preserving adversarial examples from attacks that mainly succeed through signal degradation.

5. Results

5.1. Frequency-Domain Attack Results

Table 2 reports the ASR of STFT-magnitude-domain attacks across the four detectors and four vocoders. The strongest result is observed for PGD on ResNet+LFCC, where ASR reaches approximately

100 %

across all vocoders. Since the attack is optimized against AASIST rather than ResNet+LFCC, this indicates strong black-box transfer from the AASIST source detector to the LFCC-based attribution detector.

However, this transfer does not generalize uniformly across detectors. The same STFT-magnitude PGD perturbation has limited effect on LCNN+CQCC and no measurable effect on BiLSTM. For AASIST, PGD achieves moderate source-detector ASR, ranging from

26 %

to

53 %

across vocoders. FGSM, BIM, and CW remain weak in the STFT-magnitude setting, with low ASR on both AASIST and most black-box detectors. Overall, the frequency-domain results show that STFT-magnitude PGD is effective mainly against LFCC-based attribution, while the other attacks are largely ineffective under the tested budget.

5.2. Waveform-Domain Attack Results

Table 3 reports the ASR of waveform-domain attacks. Compared with the STFT-magnitude setting, waveform-domain perturbations produce broader transferability across feature-based detectors. LCNN+CQCC is the most vulnerable target, with all four waveform attacks reaching

100 %

ASR across all vocoders. ResNet+LFCC is also highly vulnerable, with most attacks producing near-complete attribution failure.

The BiLSTM detector shows a more mixed pattern. It is highly vulnerable for HiFi-GAN, but its response varies substantially for the other vocoders. For example, Fullband MelGAN remains relatively robust to PGD, FGSM, and BIM, while CW reaches high ASR. This indicates that BiLSTM transferability depends more strongly on the interaction between attack type and source vocoder.

For AASIST, waveform-domain BIM and CW are the strongest attacks. BIM reaches

100 %

ASR across all vocoders, while CW achieves consistently high ASR. PGD produces moderate ASR, whereas FGSM fails completely against AASIST. However, FGSM still transfers strongly to ResNet+LFCC and LCNN+CQCC, showing that source-detector success does not necessarily predict black-box transferability.

5.3. Cross-Domain Comparison

Table 4 summarizes the average ASR across vocoders for both perturbation domains. The results show a clear domain effect. In the STFT-magnitude domain, high ASR is concentrated mainly in PGD against ResNet+LFCC. In the waveform domain, high ASR is distributed more broadly across attacks and detectors, especially for ResNet+LFCC and LCNN+CQCC.

The ranking of attacks also changes across domains. For AASIST, STFT-magnitude PGD is the strongest frequency-domain attack, while waveform-domain BIM and CW are much stronger in the waveform setting. For BiLSTM, STFT-magnitude attacks have almost no effect, whereas waveform-domain attacks produce higher but more variable ASR. These results show that perturbation domain is a major factor in adversarial effectiveness and transferability.

5.4. Black-Box Transferability Across Detectors

Table 5 compares source-detector ASR with black-box transfer ASR averaged across vocoders. Since all attacks are optimized against AASIST, the AASIST column represents source-detector performance, while ResNet+LFCC, LCNN+CQCC, and BiLSTM represent black-box target detectors.

In the STFT-magnitude domain, PGD achieves the strongest source-detector ASR and transfers almost completely to ResNet+LFCC, but it transfers poorly to LCNN+CQCC and BiLSTM. The other STFT-magnitude attacks show weak source and transfer performance. In contrast, waveform-domain BIM and CW achieve both strong source ASR and high average black-box ASR. Waveform-domain FGSM is the main exception: it achieves

0.00 %

ASR on AASIST but high average black-box ASR, indicating that black-box transferability must be evaluated separately from source-model success.

5.5. Audio Quality and Intelligibility

Table 6 reports Whisper WER, Table 7 reports SNR, and Table 8 summarizes the average WER and SNR across vocoders. The clean vocoded audio has an average WER of

3.96 %

, providing the baseline for intelligibility comparison.

For STFT-magnitude attacks, FGSM, BIM, and CW preserve audio quality well. Their WER remains close to the clean baseline and their SNR values are high, but their ASR values are low in most detector settings. STFT-magnitude PGD produces stronger transfer to ResNet+LFCC, but it increases average WER to

5.24 %

and reduces average SNR to

11.20

dB. This indicates a stronger distortion–effectiveness trade-off for PGD in the frequency domain.

For waveform-domain attacks, FGSM and BIM achieve high ASR on several detectors but substantially degrade audio quality. FGSM increases average WER to

17.53 %

with an average SNR of

3.33

dB, while BIM increases average WER to

12.84 %

with an average SNR of

6.59

dB. These results suggest that their high ASR should be interpreted cautiously. In contrast, waveform-domain PGD and CW keep WER close to the clean baseline, with average WER values of

4.03 %

and

4.07 %

, respectively. Together with their strong ASR results, waveform-domain PGD and CW provide the strongest quality-aware attack performance.

6. Discussion

The results show that adversarial robustness in vocoder fingerprint detection is not determined by the detector alone. Instead, it depends on the interaction between the attack algorithm, the perturbation domain, and the detector representation. This is important because a detector that appears robust under one attack setting may still be vulnerable under another.

6.1. Perturbation Domain and Detector Representation

The comparison between waveform-domain and STFT-magnitude-domain attacks shows that perturbation domain has a major effect on adversarial behavior. STFT-magnitude attacks mainly expose representation-specific vulnerability. In particular, STFT-magnitude PGD transfers strongly to ResNet+LFCC but has limited impact on LCNN+CQCC and BiLSTM. This suggests that the perturbation disrupts spectral structures that are highly relevant to LFCC-based attribution, but less damaging to CQCC-based or recurrent detectors.

Waveform-domain attacks produce a different pattern. They transfer more broadly to feature-based detectors, especially ResNet+LFCC and LCNN+CQCC. This indicates that direct waveform perturbations can disrupt downstream feature extraction even when the target detector itself does not operate directly on raw waveform input. Therefore, feature-based detectors should not be assumed to be protected from waveform-level adversarial manipulation.

These findings suggest that robustness evaluation should include more than one perturbation domain. A detector may appear robust against STFT-magnitude perturbations while remaining vulnerable to waveform-domain attacks, or vice versa. The results therefore support a domain-aware view of adversarial robustness, where attack effectiveness must be interpreted in relation to the signal representation being perturbed and the representation used by the target detector.

6.2. Source-Model Success Does Not Guarantee Transferability

All attacks in this study are optimized against AASIST, but the strongest transfer effects do not always correspond to the strongest source-model attacks. The clearest example is waveform-domain FGSM. Although it fails to achieve fake-to-real evasion on AASIST, it still causes substantial attribution failure on black-box feature-based detectors. This shows that source-detector ASR is not sufficient for evaluating the broader risk of an attack.

This result has practical implications for black-box adversarial evaluation. If evaluation only considers the source detector, an attack such as waveform-domain FGSM would appear ineffective. However, when evaluated against independent target detectors, the same perturbation produces strong attribution failure. Therefore, black-box transferability should be evaluated explicitly across independent detectors rather than inferred from source-model performance.

The detector-specific transfer patterns also suggest that different detector families rely on different fingerprint cues. LFCC-based, CQCC-based, recurrent, and raw-waveform detectors do not respond identically to the same adversarial perturbation. This means that robustness claims based on a single detector architecture are limited. A more reliable evaluation should include heterogeneous target detectors with different front-end representations and model structures.

6.3. Quality-Aware Interpretation of Attack Strength

The audio quality results show that high ASR alone is not enough to identify a strong adversarial attack. Some attacks achieve high detector disruption only at the cost of substantial signal degradation. Waveform-domain FGSM and BIM are the main examples: they produce strong black-box ASR, but they also substantially increase WER and produce very low SNR. Their effectiveness should therefore be interpreted cautiously, because part of their success may come from destructive distortion rather than subtle adversarial manipulation.

Waveform-domain PGD and CW provide a stronger trade-off. Both maintain WER close to the clean baseline while still achieving strong attack success across several detector settings. This makes them more convincing as quality-preserving adversarial attacks. In contrast, STFT-magnitude FGSM, BIM, and CW preserve audio quality but are weak in terms of ASR. STFT-magnitude PGD is more effective, especially against ResNet+LFCC, but its low SNR indicates stronger spectral distortion.

These results show why adversarial audio evaluation should report both attack effectiveness and audio quality. Without WER and SNR, waveform-domain FGSM and BIM would appear stronger than they actually are. With quality metrics included, waveform-domain PGD and CW emerge as more meaningful attacks because they better balance detector disruption and speech preservation.

7. Conclusion

This paper presented a comparative study of adversarial attacks against audio deepfake and vocoder fingerprint detectors. Four attacks, FGSM, BIM, PGD, and CW, were evaluated in both waveform-domain and STFT-magnitude-domain settings. All attacks were optimized against AASIST using a targeted fake-to-real objective and were then evaluated for black-box transferability across detectors with different input representations, including ResNet+LFCC, LCNN+CQCC, BiLSTM, and AASIST.

The results show that adversarial effectiveness depends strongly on the interaction between attack algorithm, perturbation domain, and detector representation. STFT-magnitude PGD transfers strongly to ResNet+LFCC but has limited effect on LCNN+CQCC and BiLSTM, indicating that frequency-domain vulnerability is representation-specific. In contrast, waveform-domain attacks produce broader transferability across feature-based detectors, particularly ResNet+LFCC and LCNN+CQCC. However, source-detector success does not always predict black-box transferability, as shown by waveform-domain FGSM, which fails against AASIST but transfers strongly to several black-box detectors.

The quality evaluation further shows that high ASR alone is insufficient for assessing adversarial audio. Waveform-domain FGSM and BIM achieve strong detector disruption but substantially degrade WER and SNR, while waveform-domain PGD and CW preserve intelligibility more effectively while maintaining strong attack performance. These findings highlight the need to evaluate adversarial attacks using both detection metrics and audio quality metrics.

Overall, this study demonstrates that robustness claims in audio deepfake and vocoder fingerprint detection should not be based on a single attack, detector, or perturbation domain. Future work should extend this comparison to additional source detectors, larger and more diverse datasets, additional vocoder families, and defended detection pipelines. Further work should also examine adaptive attacks and defense strategies such as preprocessing-based defenses, adversarial training, detector ensembling, and representation-level regularization.

References

Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
Mustafa, A.; Pia, N.; Fuchs, G. StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 6034–6038. [Google Scholar]
Yang, G.; Yang, S.; Liu, K.; Fang, P.; Chen, W.; Xie, L. Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop, Shenzhen, China, 19–22 January 2021; pp. 492–498. [Google Scholar]
Yamamoto, R.; Song, E.; Kim, J.-M. Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
Li, M.; Ahmadiadli, Y.; Zhang, X.-P. A Survey on Speech Deepfake Detection. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
Sisman, B.; Yamagishi, J.; King, S.; Li, H. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
Tan, X.; Qin, T.; Soong, F.; Liu, T.-Y. A Survey on Neural Speech Synthesis. arXiv 2021, arXiv:2106.15561. [Google Scholar] [CrossRef]
Yi, J.; Wang, C.; Tao, J.; Zhang, X.; Zhang, C.Y.; Zhao, Y. Audio Deepfake Detection: A Survey. arXiv 2023, arXiv:2308.14970. [Google Scholar] [CrossRef]
Hasanabadi, M.R. An Overview of Text-to-Speech Systems and Media Applications. arXiv 2023, arXiv:2310.14301. [Google Scholar]
Spanias, A.S. Speech Coding: A Tutorial Review. Proc. IEEE 1994, 82, 1541–1582. [Google Scholar] [CrossRef]
Shaaban, O.A.; Yildirim, R.; Alguttar, A.A. Audio Deepfake Approaches. IEEE Access 2023, 11, 132652–132682. [Google Scholar] [CrossRef]
Zhou, X.; Garcia-Romero, D.; Duraiswami, R.; Espy-Wilson, C.; Shamma, S. Linear versus Mel Frequency Cepstral Coefficients for Speaker Recognition. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011; pp. 559–564. [Google Scholar]
Todisco, M.; Delgado, H.; Evans, N. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients. In Proceedings of the Odyssey 2016: The Speaker and Language Recognition Workshop, Bilbao, Spain, 21–24 June 2016; pp. 283–290. [Google Scholar]
Jung, J.-w.; Heo, H.-S.; Tak, H.; Shim, H.-j.; Chung, J.S.; Lee, B.-J.; Yu, H.-J.; Evans, N. AASIST: Audio Anti-Spoofing Using Integrated Spectro-Temporal Graph Attention Networks. arXiv 2021, arXiv:2110.01200. [Google Scholar]
Wang, X.; He, K. Enhancing the Transferability of Adversarial Attacks through Variance Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1924–1933. [Google Scholar]
Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. Adversarial Attacks and Defences: A Survey. arXiv 2018, arXiv:1810.00069. [Google Scholar] [CrossRef]
Demontis, A.; Melis, M.; Pintor, M.; Jagielski, M.; Biggio, B.; Oprea, A.; Nita-Rotaru, C.; Roli, F. Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 321–338. [Google Scholar]
Żelasko, P.; Joshi, S.; Shao, Y.; Villalba, J.; Trmal, J.; Dehak, N.; Khudanpur, S. Adversarial Attacks and Defenses for Speech Recognition Systems. arXiv 2021, arXiv:2103.17122. [Google Scholar] [CrossRef]
Sun, C.; Jia, S.; Hou, S.; Lyu, S. AI-Synthesized Voice Detection Using Neural Vocoder Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 904–912. [Google Scholar]
Deng, J.; Ren, Y.; Zhang, T.; Zhu, H.; Sun, Z. VFD-Net: Vocoder Fingerprints Detection for Fake Audio. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, 14–19 April 2024; pp. 12151–12155. [Google Scholar]
Li, F.; Chen, Y.; Liu, H.; Zhao, Z.; Yao, Y.; Liao, X. Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
Costa, J.C.; Roxo, T.; Proença, H.; Inacio, P.R.M. How Deep Learning Sees the World: A Survey on Adversarial Attacks and Defenses. IEEE Access 2024, 12, 61113–61136. [Google Scholar] [CrossRef]
Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial Machine Learning at Scale. arXiv 2016, arXiv:1611.01236. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Carlini, N.; Wagner, D. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 3–14. [Google Scholar]
Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial Examples in the Physical World. In Artificial Intelligence Safety and Security; Yampolskiy, R.V., Ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
Papernot, N.; McDaniel, P.; Goodfellow, I. Transferability in Machine Learning: From Phenomena to Black-Box Attacks Using Adversarial Samples. arXiv 2016, arXiv:1605.07277. [Google Scholar] [CrossRef]
Ito, K.; Johnson, L. The LJ Speech Dataset. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 25 May 2026).
Yan, X.; Yi, J.; Tao, J.; Wang, C.; Ma, H.; Wang, T.; Wang, S.; Fu, R. An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, Lisboa, Portugal, 14 October 2022; pp. 61–68. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]

Table 2. Attack success rate (%) of STFT-magnitude-domain attacks across detectors and vocoders.

Detector	Vocoder	Clean Acc.	PGD	FGSM	BIM	CW
ResNet+LFCC	HiFi-GAN	99.3	99.99	6.00	2.00	6.00
	Fullband MelGAN	96.8	99.99	3.00	4.00	6.00
	StyleMelGAN	98.7	100.00	4.00	2.00	0.00
	Parallel WaveGAN	100.0	99.99	0.00	0.00	0.00
LCNN+CQCC	HiFi-GAN	96.0	0.00	7.00	8.00	6.00
	Fullband MelGAN	95.0	0.00	27.00	15.00	10.00
	StyleMelGAN	100.0	0.00	2.00	2.00	1.00
	Parallel WaveGAN	100.0	15.00	0.00	0.00	0.00
BiLSTM	HiFi-GAN	99.79	0.00	1.91	0.74	1.38
	Fullband MelGAN	99.47	0.00	4.03	1.70	1.17
	StyleMelGAN	100.0	0.00	0.11	0.42	0.00
	Parallel WaveGAN	100.0	0.00	0.00	0.00	0.00
AASIST	HiFi-GAN	94.0	34.00	6.69	6.37	6.37
	Fullband MelGAN	93.0	45.00	8.60	8.07	8.28
	StyleMelGAN	97.0	26.00	3.40	3.08	3.29
	Parallel WaveGAN	94.0	53.00	6.79	6.37	6.69

Table 3. Attack success rate (%) of waveform-domain attacks across detectors and vocoders.

Detector	Vocoder	Clean Acc.	PGD	FGSM	BIM	CW
ResNet+LFCC	HiFi-GAN	99.3	100.0	100.0	100.0	100.0
	Fullband MelGAN	96.8	92.0	100.0	100.0	100.0
	StyleMelGAN	98.7	100.0	1.0	100.0	100.0
	Parallel WaveGAN	100.0	100.0	100.0	95.0	100.0
LCNN+CQCC	HiFi-GAN	96.0	100.0	100.0	100.0	100.0
	Fullband MelGAN	95.0	100.0	100.0	100.0	100.0
	StyleMelGAN	100.0	100.0	100.0	100.0	100.0
	Parallel WaveGAN	100.0	100.0	100.0	100.0	100.0
BiLSTM	HiFi-GAN	99.79	99.79	100.0	100.0	100.0
	Fullband MelGAN	99.47	9.45	0.42	1.06	99.79
	StyleMelGAN	100.0	32.59	98.62	96.92	100.0
	Parallel WaveGAN	100.0	0.32	100.0	100.0	9.87
AASIST	HiFi-GAN	94.0	37.0	0.0	100.0	85.35
	Fullband MelGAN	93.0	39.0	0.0	100.0	89.60
	StyleMelGAN	97.0	24.0	0.0	100.0	71.66
	Parallel WaveGAN	94.0	38.0	0.0	100.0	87.69

Table 4. Average ASR (%) across vocoders for STFT-magnitude and waveform-domain attacks.

Domain	Detector	PGD	FGSM	BIM	CW
STFT-mag	ResNet+LFCC	99.99	3.25	2.00	3.00
	LCNN+CQCC	3.75	9.00	6.25	4.25
	BiLSTM	0.00	1.51	0.72	0.64
	AASIST	39.50	6.37	5.97	6.16
Waveform	ResNet+LFCC	98.00	75.25	98.75	100.00
	LCNN+CQCC	100.00	100.00	100.00	100.00
	BiLSTM	35.54	74.76	74.50	77.42
	AASIST	34.50	0.00	100.00	83.58

Table 5. Source-detector ASR and black-box transfer ASR averaged across vocoders. AASIST is the source detector, while ResNet+LFCC, LCNN+CQCC, and BiLSTM are black-box target detectors.

Domain	Attack	AASIST ASR	ResNet+LFCC	LCNN+CQCC	BiLSTM	Avg. Black-box ASR
STFT-mag	PGD	39.50	99.99	3.75	0.00	34.58
STFT-mag	FGSM	6.37	3.25	9.00	1.51	4.59
STFT-mag	BIM	5.97	2.00	6.25	0.72	2.99
STFT-mag	CW	6.16	3.00	4.25	0.64	2.63
Waveform	PGD	34.50	98.00	100.00	35.54	77.85
Waveform	FGSM	0.00	75.25	100.00	74.76	83.34
Waveform	BIM	100.00	98.75	100.00	74.50	91.08
Waveform	CW	83.58	100.00	100.00	77.42	92.47

Table 6. Whisper WER (%) of clean and adversarial audio across vocoders.

Domain	Attack	HiFi-GAN	Fullband MelGAN	StyleMelGAN	Parallel WaveGAN	Avg. WER
Clean	None	3.74	3.99	4.12	3.97	3.96
STFT-mag	FGSM	3.72	3.97	4.03	3.99	3.93
	BIM	3.72	3.98	4.02	4.06	3.95
	CW	3.74	4.01	4.05	4.01	3.95
	PGD	4.65	5.50	5.51	5.28	5.24
Waveform	FGSM	16.28	17.83	16.05	19.94	17.53
	BIM	11.77	13.92	12.35	13.30	12.84
	CW	3.85	4.18	4.17	4.09	4.07
	PGD	3.84	4.12	4.16	3.98	4.03

Table 7. Signal-to-noise ratio (SNR, dB) of adversarial audio across vocoders. Values are reported as mean ± standard deviation.

Domain	Vocoder	FGSM	BIM	PGD	CW
STFT-mag	HiFi-GAN	$41.69 \pm 1.24$	$46.31 \pm 1.24$	$11.00 \pm 1.28$	$49.40 \pm 1.29$
	Fullband MelGAN	$41.80 \pm 1.27$	$46.50 \pm 1.27$	$10.90 \pm 1.28$	$49.50 \pm 1.33$
	StyleMelGAN	$42.80 \pm 1.29$	$47.40 \pm 1.28$	$11.90 \pm 1.25$	$50.48 \pm 1.37$
	Parallel WaveGAN	$42.00 \pm 1.28$	$46.50 \pm 1.28$	$11.00 \pm 1.28$	$49.70 \pm 1.34$
Waveform	HiFi-GAN	$2.95 \pm 1.22$	$6.20 \pm 1.23$	$32.90 \pm 1.22$	$23.90 \pm 1.24$
	Fullband MelGAN	$3.13 \pm 1.26$	$6.39 \pm 1.27$	$33.10 \pm 1.26$	$24.10 \pm 1.28$
	StyleMelGAN	$4.09 \pm 1.25$	$7.35 \pm 1.28$	$34.10 \pm 1.21$	$25.00 \pm 1.23$
	Parallel WaveGAN	$3.16 \pm 1.27$	$6.42 \pm 1.28$	$33.20 \pm 1.26$	$24.10 \pm 1.28$

Table 8. Average WER and SNR across vocoders.

Domain	Attack	Avg. WER	$Δ$ WER	Avg. SNR
STFT-mag	FGSM	$3.93$	$- 0.03$	$42.07$
STFT-mag	BIM	$3.95$	$- 0.01$	$46.68$
STFT-mag	PGD	$5.24$	$1.28$	$11.20$
STFT-mag	CW	$3.95$	$0.00$	$49.77$
Waveform	FGSM	$17.53$	$13.57$	$3.33$
Waveform	BIM	$12.84$	$8.88$	$6.59$
Waveform	PGD	$4.03$	$0.07$	$33.33$
Waveform	CW	$4.07$	$0.12$	$24.28$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Evaluating Adversarial Robustness of Deepfake Audio Detectors and Vocoder Fingerprint Detectors Against Universal Adversarial Perturbations

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

2.1. Audio Deepfake Detection and Vocoder Fingerprints

2.2. Detector Representations for Synthetic Speech Detection

2.3. Adversarial Attacks and Transferability

3. Methodology

3.1. Overview

3.2. Problem Formulation

3.3. Universal Perturbation Setting

3.4. Perturbation Domains

3.4.1. Waveform-Domain Perturbation

3.4.2. STFT-Magnitude-Domain Perturbation

3.5. Attack Algorithms

3.6. Transferability Evaluation

3.7. Audio Quality and Intelligibility Metrics

4. Experimental Setup

4.1. Dataset and Vocoders

4.2. Detector Configuration

4.3. Attack Configuration

4.4. Evaluation Protocol

5. Results

5.1. Frequency-Domain Attack Results

5.2. Waveform-Domain Attack Results

5.3. Cross-Domain Comparison

5.4. Black-Box Transferability Across Detectors

5.5. Audio Quality and Intelligibility

6. Discussion

6.1. Perturbation Domain and Detector Representation

6.2. Source-Model Success Does Not Guarantee Transferability

6.3. Quality-Aware Interpretation of Attack Strength

7. Conclusion

References

MDPI Initiatives

Important Links

Subscribe