Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

These authors contributed equally to the work.
Version 1 : Received: 2 June 2023 / Approved: 5 June 2023 / Online: 5 June 2023 (02:27:49 CEST)

A peer-reviewed article of this Preprint also exists.

González-Docasal, A.; Álvarez, A. Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics. Appl. Sci. 2023, 13, 8049. González-Docasal, A.; Álvarez, A. Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics. Appl. Sci. 2023, 13, 8049.

Abstract

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

Keywords

Voice Cloning; Speech Synthesis; Speech Quality Evaluation

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.