Version 1
: Received: 18 October 2023 / Approved: 19 October 2023 / Online: 19 October 2023 (07:02:12 CEST)
Version 2
: Received: 1 November 2023 / Approved: 2 November 2023 / Online: 2 November 2023 (08:13:19 CET)
Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci.2023, 13, 11988.
Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci. 2023, 13, 11988.
Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci.2023, 13, 11988.
Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci. 2023, 13, 11988.
Abstract
We present an any-to-one voice conversion (VC) system, using an autoregressive model and LPCNet vocoder, aimed to enhance the converted speech in terms of naturalness, intelligibility, and speaker similarity. As the name implies, non-parallel any-to-one voice conversion does not require paired source and target speeches and can be employed for arbitrary speech conversion tasks. Recent advancements in neural-based vocoders, such as WaveNet, have improved the efficiency of speech synthesis. However, in practice, we find that the trajectory of some generated waveforms is not consistently smooth, leading to occasional voice errors. To address this issue, we propose to use an autoregressive (AR) conversion model along with the high-fidelity LPCNet vocoder. This combination not only solves the problems of waveform fluidity but also produces more natural and clear speech, with the added capability of real-time speech generation. To precisely represent the linguistic content of a given utterance, we use speaker-independent PPG features (SI-PPG) computed from an automatic speech recognition (ASR) model trained on a multi-speaker corpus. Next, a conversion model maps the SI-PPG to the acoustic representations used as input features for the LPCNet. The proposed autoregressive structure enables our system to produce the following prediction step outputs from the acoustic features predicted in the previous step. We evaluate the effectiveness of our system by performing any-to-one conversion pairs between native English speakers. Experimental results show that the proposed method outperforms state-of-the-art systems, producing higher speech quality and greater speaker similarity.
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received:
2 November 2023
Commenter:
Kadria Ezzine
Commenter's Conflict of Interests:
Author
Comment:
This manuscript is the final proofread version after being reviewed by 2 reviewers and accepted by Applied Sciences, details of all changes are as follows: - we have thoroughly reviewed and addressed the grammar and spelling issues in the paper - we have reinforced the innovation of the method proposed in the "Method" section - we have corrected figures 3, 4, and 5 using a professional tool to improve their quality and place them correctly in the cross-reference section. - we have cited six more recent literature - we have adjusted the values deduced from the data in the table. - we have addressed all the comments received from reviewers
Commenter: Kadria Ezzine
Commenter's Conflict of Interests: Author
- we have thoroughly reviewed and addressed the grammar and spelling issues in the paper
- we have reinforced the innovation of the method proposed in the "Method" section
- we have corrected figures 3, 4, and 5 using a professional tool to improve their quality and place them correctly in the cross-reference section.
- we have cited six more recent literature
- we have adjusted the values deduced from the data in the table.
- we have addressed all the comments received from reviewers