Preprint Article Version 2 Preserved in Portico This version is not peer-reviewed

Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder

Version 1 : Received: 18 October 2023 / Approved: 19 October 2023 / Online: 19 October 2023 (07:02:12 CEST)
Version 2 : Received: 1 November 2023 / Approved: 2 November 2023 / Online: 2 November 2023 (08:13:19 CET)

A peer-reviewed article of this Preprint also exists.

Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci. 2023, 13, 11988. Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci. 2023, 13, 11988.

Abstract

We present an any-to-one voice conversion (VC) system, using an autoregressive model and LPCNet vocoder, aimed to enhance the converted speech in terms of naturalness, intelligibility, and speaker similarity. As the name implies, non-parallel any-to-one voice conversion does not require paired source and target speeches and can be employed for arbitrary speech conversion tasks. Recent advancements in neural-based vocoders, such as WaveNet, have improved the efficiency of speech synthesis. However, in practice, we find that the trajectory of some generated waveforms is not consistently smooth, leading to occasional voice errors. To address this issue, we propose to use an autoregressive (AR) conversion model along with the high-fidelity LPCNet vocoder. This combination not only solves the problems of waveform fluidity but also produces more natural and clear speech, with the added capability of real-time speech generation. To precisely represent the linguistic content of a given utterance, we use speaker-independent PPG features (SI-PPG) computed from an automatic speech recognition (ASR) model trained on a multi-speaker corpus. Next, a conversion model maps the SI-PPG to the acoustic representations used as input features for the LPCNet. The proposed autoregressive structure enables our system to produce the following prediction step outputs from the acoustic features predicted in the previous step. We evaluate the effectiveness of our system by performing any-to-one conversion pairs between native English speakers. Experimental results show that the proposed method outperforms state-of-the-art systems, producing higher speech quality and greater speaker similarity.

Keywords

voice conversion; non-parallel data; autoregressive model; LPCNet; phonetic PosteriorGrams

Subject

Engineering, Other

Comments (1)

Comment 1
Received: 2 November 2023
Commenter: Kadria Ezzine
Commenter's Conflict of Interests: Author
Comment: This manuscript is the final proofread version after being reviewed by 2 reviewers and accepted by Applied Sciences, details of all changes are as follows: 
-  we have thoroughly reviewed and addressed the grammar and spelling issues in the paper
- we have reinforced the innovation of the method proposed in the "Method" section
- we have corrected figures  3, 4, and 5 using a professional tool to improve their quality and place them correctly in the cross-reference section. 
- we have cited six more recent literature 
- we have adjusted the values deduced from the data in the table.
- we have addressed all the comments received from reviewers 
+ Respond to this comment

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 1
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.