Preprint Communication Version 1 Preserved in Portico This version is not peer-reviewed

Pre-training of Multi-order Acoustic Simulation for Replay Voice Spoofing Detection

Version 1 : Received: 26 July 2023 / Approved: 27 July 2023 / Online: 28 July 2023 (10:14:32 CEST)

A peer-reviewed article of this Preprint also exists.

Go, C.; Park, N.I.; Jeon, O.-Y.; Chun, C. A Pre-Training Framework Based on Multi-Order Acoustic Simulation for Replay Voice Spoofing Detection. Sensors 2023, 23, 7280. Go, C.; Park, N.I.; Jeon, O.-Y.; Chun, C. A Pre-Training Framework Based on Multi-Order Acoustic Simulation for Replay Voice Spoofing Detection. Sensors 2023, 23, 7280.

Abstract

Voice spoofing attempts to break into a specific automatic speaker verification (ASV) system by forging the user’s voice, and can be used through methods, such as text-to-speech (TTS), voice conversion (VC), and replay attacks. Recently, deep learning-based voice spoofing countermeasures have been developed. however, the problem with replay is that it is difficult to construct a large number of datasets because it requires a physical recording process. To overcome these problems, this study proposes a pre-training framework based on multi-order acoustic simulation for replay voice spoofing detection. Multi-order acoustic simulation utilizes existing clean signal and room impulse response (RIR) datasets to generate audios, which simulate the various acoustic configurations of the original and replayed audios. The acoustic configuration refers to factors, such as the microphone type, reverberation, time delay, and noise that may occur between a speaker and microphone during the recording process. We assume that a deep learning model trained on an audio that simulates the various acoustic configurations of the original and replayed audios can classify the acoustic configurations of the original and replay audios well. To validate this, we performed pre-training to classify the audio generated by the multi-order acoustic simulation into 3 classes: clean signal, audio simulating the acoustic configuration of the original audio, and audio simulating the acoustic configuration of the replay audio. We also set the weights of the pre-training model to the initial weights of the replay voice spoofing detection model using the existing replay voice spoofing dataset and then performed fine-tuning. To validate the effectiveness of the proposed method, we evaluated the performance of the conventional method without pre-training and proposed method using an objective metric, i.e., the accuracy. As a result, the conventional method achieved 92.94% accuracy and proposed method achieved 98.16% accuracy.

Keywords

voice spoofing; acoustic configuration; deep learning

Subject

Computer Science and Mathematics, Signal Processing

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.