Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

Ander González Docasal; Aitor Álvarez

doi:10.20944/preprints202306.0223.v1

Submitted:

02 June 2023

Posted:

05 June 2023

You are already at the latest version

Abstract

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

Keywords:

Voice Cloning

;

Speech Synthesis

;

Speech Quality Evaluation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Voice Cloning, a rapidly evolving research area, has gained significant attention in recent years. Its main objective is to produce synthetic utterances that closely resemble those of a specific speaker, referred to as the cloned speaker. This technique holds significant potential in various domains, particularly in the media industry. Applications include long-form reading of textual content, such as emails and webpages, audiobook narration, voiceovers, dubbing, and more [2]. The rising demand for voice cloning can be attributed to the significant advancements in Deep Learning techniques, which have led to notable improvements in the quality of these systems [3].

However, ensuring the quality of the input data is imperative to obtain accurate results when employing voice cloning techniques based on deep learning algorithms. It is essential that the input audios for a specific speaker possess optimal acoustic conditions, as the cloning algorithm will replicate the training material, including any noise or audio artefacts present in the signals. This encompasses the need for minimal audio compression and optimal sampling and bit rates. Furthermore, the heterogeneity of linguistic characteristics is closely associated with the quality of a voice cloning corpus. High variability in features such as prosody, pitch, pause duration, or rhythm can have a detrimental impact on the training of a voice cloning system. Addressing this issue may need a larger volume of data and/or manual annotations to ensure the satisfactory quality of the final cloned output.

Another critical challenge in the development of Voice Cloning systems pertains to the measurement of the synthetic voice’s quality [4]. The Mean Opinion Score (MOS) constitutes the most precise metric for evaluating voice quality [5], which needs human evaluators to manually listen to the synthesised audios and rate them on a scale ranging from 1 to 5. Nevertheless, due to its nature, this approach primarily serves as a means of evaluating the quality of final audios and is not suitable for assessing the quality of during-train checkpoints. In addition, the utilisation of a subjective metric that is dependent on the perception of human evaluators to assess cloned voices may produce variable results based on the unique circumstances of each evaluator [6]. Furthermore, the recent trend of MOS scores in Voice Cloning systems approaching levels near those of real human speech reveals the limitations of this metric in terms of comparing different models [7].

As a means of overcoming these two main challenges associated with evaluating Voice Cloning systems – the impact of input data and the lack of objective measurements – we propose an alternative evaluation framework that strives to achieve the following objectives: (1) calculating the variability of a given Voice Cloning dataset in order to filter unwanted input, (2) using objective metrics that measure the quality of generated signals, and (3) conducting these measurements during the training process in order to monitor model improvement or lack thereof.

The proposed evaluation framework was applied to one of the datasets described in [1] as a continuation of our previous work. The difficulty of this particular dataset is due in part to the quality of the audios, as well as to the lack of enough training data. However, we managed to prove that reducing the variability of this dataset by excluding specific subsets from the training partition improves the quality of the final generated audios. Additionally, the same evaluation has been implemented on two distinct high-fidelity Voice Cloning datasets, which were in English and Spanish respectively.

To begin, the public tool Montreal Forced Aligner (MFA)1 was utilised to perform forced-alignment on each corpus. Subsequently, after excluding non-aligned audios, the alignments were used to calculate various quality metrics on both datasets, including Signal-to-Noise Ratio (SNR) and utterance speed. These measurements allowed to eliminate audios that introduced higher variability. Various sets of models were trained both with and without these collections of more irregular audios.

As our Voice Cloning framework, a Text-to-Speech (TTS) approach using the neural acoustic model Tacotron-2 [7] was adopted, which is a well-established model in the Speech Synthesis community. We trained various models using the aforementioned audio datasets, both with the complete versions and after excluding the subsets that introduced higher variability. The spectrograms generated by the Tacotron-2 model were transformed to waveform using the publicly available Universal model of the Vocoder HiFi-GAN [8].

To gauge the efficacy of the proposed Voice Cloning system, various quality evaluation metrics were employed. For evaluating the quality of the generated audios without a reference signal, two distinct MOS estimators were employed: NISQA [9] and MOSnet [4]. Additionally, a novel algorithm has been introduced in order to determine the percentage of aligned characters in the attention matrix of the model as a final metric, which excluded possibly unaligned audios. The aforementioned measurements were conducted on the distinct checkpoints obtained during the training process of these datasets to monitor the models’ progress over time.

The remainder of this paper is structured as follows: Section 2 explores the related work in the research field of Voice Cloning. Section 3 includes the analysis of the two main corpora utilised in this work, whilst Section 4 describes the voice cloning system and its training framework. Section 5 explains the different training scenarios and metrics used for the evaluation presented in Section 6. Finally, Section 7 draws the main conclusions and the lines of future work.

2. Related Work

The field of Voice Cloning has witnessed significant advancements in recent years, mainly driven by the remarkable progress in deep learning techniques. Numerous studies have explored various approaches and methodologies to tackle the challenges associated with this research field.

In terms of applications of Voice Cloning, a prominent one is observed within the scope of deep faking [10,11]. An illustrative instance is the interactive artwork released by the Salvador Dalí Museum, featuring a deep fake representation of the renowned artist [12]. In terms of audio only, the AhoMyTTS project focuses on generating a collection of synthetic voices to aid individuals who are orally disabled or have lost their own voices [13]. Similarly, the Speech-to-Speech Parrotron [14] model serves the purpose of normalising atypical speech by converting it to the voice of a canonical speaker without speech disorders, thereby enhancing its intelligibility. Another successful voice cloning endeavour was demonstrated in the Euphonia Project [15], wherein the voice of a former American football player diagnosed with ALS was recovered through the utilisation of a database containing his recordings [16]. Finally, the voice of former Spanish dictator Francisco Franco was cloned for the emission of the XRey podcast [17], winner of the Best Podcast Ondas Award in the National Radio category, needed for the synthesis of a letter and an interview in a process explained in [1] by the authors.

This research field can be broadly categorised into two main branches: Voice Conversion and Speech Synthesis or Text-to-Speech (TTS). Voice conversion aims to transform the characteristics of a source speaker’s voice into those of a target speaker, while preserving the linguistic content and speech quality of the original audio. In the past few years, the efficacy of deep learning techniques in Voice Conversion has been well-established. Various architectures including autoencoders have gained popularity for this purpose, such as variational autoencoders [18] and bottleneck-based autoencoders [19,20]. Researchers have also explored the application of GANs in this specific task [21,22,23]. Additionally, deep feature extractors have been leveraged to achieve successful outcomes. For instance, van Niekerk et al. [24] employed an approach based on HuBERT [25] for many-to-one voice conversion.

On the other hand, deep learning techniques have emerged as the leading approach in the field of Text-to-Speech systems. WaveNet [3], which employs dilated CNNs to directly generate waveforms, marked a significant milestone in this domain. Since then, numerous neural architectures have been developed by the scientific community. With Tacotron-2 [7], an attention-based system combined with a set of Long-Short Term Memory layers, many acoustic models that generate a Mel frequency spectrogram from linguistic input have been developed. Some of these architectures employ a phoneme length predictor instead of an attention matrix [26,27,28]. Additionally, sequence-to-sequence modeling has gained traction in this research field. Mehta et al. [29], for example, propose the use of Neural Hidden Markov Models (HMMs) with normalising flows as an acoustic model. In the context of generating speech signal from mel spectrograms, vocoders based on Generative Adversarial Networks (GANs) [8,30,31,32] have gained popularity due to their efficient inference speed, lightweight networks, and ability to produce high-quality waveforms. Furthermore, end-to-end models like VITS [33] or YourTTS [34] have been developed, enabling direct generation of audio signals from linguistic input without the need of an additional vocoder model. Finally, important advances have been made in terms of zero-shot TTS systems that feature Voice Conversion with the use of a decoder-only architecture. As an example, the system VALL-E is capable of cloning a voice with only 3 seconds of the target speaker [35].

With regard to the corpora, numerous many-to-one Voice Cloning corpora are available in the community, primarily designed for Text-to-Speech (TTS) approaches. These corpora prioritise the measurement of dataset quality across various dimensions. Regarding signal quality, the Signal-to-Noise Ratio (SNR) holds significant importance, both during content filtering [36,37] and data recording stages [38,39,40,41]. Linguistic considerations also come into play, with some researchers emphasising the need for balanced phonemic or supraphonemic units within the dataset [38,39,41,42]. Additionally, text preprocessing techniques are employed to ensure accurate alignment with the uttered speech and reduce variability in pronunciations [36,39,40,41,42,43,44]. Lastly, the quantity of audio data generated by each speaker is a critical aspect in corpus creation, particularly in datasets with a low number of speakers [36,38,39,40,41,42,43,44].

Finally, to evaluate the quality of Voice Cloning systems, various approaches have been devised within the research community. Objective metrics that compare a degraded signal to a reference signal, such as PESQ, PEAQ, or POLQA [45], are widely available for assessing audio fidelity in scenarios involving, e.g., signal transfer or speech enhancement. However, these metrics exhibit limitations when applied to Voice Cloning systems, as the degraded signal may not be directly related to the original one. Consequently, an emerging trend focuses on estimating the Mean Opinion Score (MOS) using the cloned signal alone. MOSnet [4,5] and NISQA [9,46] are examples of systems that employ deep learning models for MOS estimation in this context. In line with this trend, the VoiceMOS Challenge [47] was introduced to address the specific issue of MOS estimation in Voice Cloning systems. The challenge served as a platform for the development of this type of systems, fostering advancements in the field and facilitating the resolution of this particular challenge.

Our study focuses on evaluating the quality of a particularly complicated voice cloning dataset featured in [1] with the aim of identifying and removing audios that contribute to heterogeneity within the overall data. The impact of these reductions was assessed within a TTS framework utilising a Tacotron-2 model as the acoustic model, paired with a HiFi-GAN-based vocoder. Furthermore, the quality of the resulting audios was evaluated using MOSnet and NISQA as MOS estimator systems. Additionally, two distinct open voice cloning corpora in English and Spanish were also processed for contrasting purposes.

3. Audio Datasets

The training and evaluation framework for Voice Cloning systems was put to test using the database XRey, primarily described in our previous work [1]. It mainly consists on audios of Spanish dictator Francisco Franco, whose voice was cloned for the podcast XRey [17], winner of an Ondas award in 2020, which recognises the best Spanish professionals in the fields of radio, television, the cinema and the music industry, and it is available in Spotify with an added special track2 in which the generation of the cloned voice is explained by the authors in detail.

Additionally, a set of two freely available High Quality (HQ) datasets was chosen, each in a different languages: English and Spanish. It was ensured that both corpora were publicly available and comprised a sufficient number of audio hours for training multiple Tacotron-2 TTS models different on training subsets.

3.1. XRey

The corpus for XRey is mainly composed of audios from Christmas speeches from years ranging 1955 to 1969 divided in three acoustically similar groups of roughly 1 h each, on a total of 3:13 hours of speech on 1 075 audio files. Even though the audios that compose this corpus are publicly available, the postprocessing and manual annotation make it a private dataset that, due to the considerations on this particular personality, will not be released to the public.

3.2. Hi-Fi TTS

The corpus chosen for the English language is the dataset Hi-Fi TTS [36]. It is composed of 10 different speakers, 6 female and 4 male, where each speaker has at least 17 hours of speech and is categorised as belonging to the subsets clean or other according to the values of SNR. From this multi-speaker corpus, the female speaker with ID 92, with a total of 27:18 hours on 35 296 audio files of clean speech was chosen as the English voice for this work.

3.3. Tux

The audio dataset chosen for the Spanish language is composed of around 100 h of a single speaker from LibriVox, the user Tux3. It is divided in two different subsets: valid and other. The valid subset has been reviewed with an Automatic Speech Recognition system based on DeepSpeech, composed of the audios whose automatic transcriptions and original text match, constituting a total of 53:47 hours of speech on 52 408 audio files. This corpus has been processed and released to the public by Github user carlfm014.

3.4. Data Analysis

During this stage of pre-training corpora evaluation, the acquired data must be analysed for a posterior cleaning thereof. To this end, the first required step is the forced alignment of the audio and text files. The Montreal Forced Aligner (MFA) was chosen for this purpose since both the tool and the models are publicly available. The model spanish_mfa v2.0.0a [48] was employed for aligning the corpora XRey and Tux, whereas the English dataset Hi-Fi TTS 92 was aligned using the english_mfa v2.0.0a [49] model. These two models implement a GMM-HMM architecture that uses MFCCs and pitch as acoustic features, and phones as text input. The spanish_mfa model was trained on

1 769.56

hours of audio content, while the english_mfa model used

3 686.98

hours for training. The alignment configuration had a beam width of 1 and a retry beam of 2. The pipeline discarded any audios that were not appropriately aligned during this process, resulting in a total of 51 397 (

98.07 %

) for Tux and 35 007 audios (

99.18 %

) for Hi-Fi TTS. This process did not discard any audios from XRey since they had been already force aligned in the dataset preparation stage, as described in our previous work [1].

As the next step on the evaluation pipeline, using the information gathered from the forced alignments – mainly time marks of individual phones and words – different measurements were performed on the audio datasets, more specifically: phonetic frequency, SNR and uttering speed.

3.4.1. Phonetic frequency

One of the key aspects on developing a corpus for Voice Cloning applications, specially in a TTS setup, is that it should be phonetically balanced in order to contain a representative number of samples of each phonetic unit [38,41,42]. Following this approach, we measured the phonetic content of both corpora in terms of frequency of phones and diphones.

Accounting for 26 Spanish and 38 English phones, a total of 386 diphone combinations for XRey, 446 for Tux, and

1 255

for Hi-Fi TTS 92 were found. It should be noted that not all of the

26^{2} = 626

and

38^{2} = 1 444

diphone combinations are phonotactically possible in Spanish and English, respectively. The distribution of diphones in both datasets is illustrated in Figure 1.

The obtained diphone distribution curves are similar to those computed in our previous work [1]. These findings confirm that significantly large text corpus result in the frequencies of phones and diphones conforming to the same numerical distributions.

3.4.2. SNR

The next metric to be evaluated for these Voice Cloning datasets is the Signal-to-Noise Ratio (SNR). For the Hi-Fi TTS dataset, Bakhturina et al. highlighted the importance of audio quality in terms of SNR [36]. After estimating the bandwidth of the speech signal, they calculated SNR by comparing the noise power in both speech and non-speech segments using a Voice Activity Detection (VAD) module. The clean subset was composed by audios of a minimum SNR value of 40 dB, while the audios with a minimum SNR value of 32 dB fell into the other subset. For speaker XRey, however, the audio quality is relatively lower due to the recordings being conducted in the third quarter of the 20^th century. In our previous work [1] the SNR values were measured using a WADA estimation algorithm.

In this study, the forced audio alignments generated in the previous step were utilised to obtain speech and non-speech segments required for SNR calculation, instead of relying on an external VAD module or WADA based algorithms. Therefore, the SNR values obtained through this method may differ from those reported by the other works. The results can be seen in Figure 2 and Table 1.

Based on the data, it can be noted that the SNR values in the HQ corpora are generally high, indicating that these two datasets are suitable for Voice Cloning applications with respect to signal quality. The same cannot be said, however, for speaker XRey. Nevertheless, many of the audios have an SNR value higher than 20 dB, which can be considered as sufficient quality in some speech applications.

3.4.3. Uttering speed

As an effort of measuring the variability of the multiple audios that compose the different corpora, the uttering speed has also been computed. Using the information obtained from the forced alignment, more specifically the duration of individual phones, the speed of each utterance S can be easily obtained by dividing the number of uttered phones by the sum of the durations of each of the phones, as shown in equation 1:

S = \frac{n}{\sum_{i}^{n} dur (p_{i})}

(1)

where dur

(p_{i})

is the duration of the individual phone

p_{i}

and n is the total number of them. S is therefore measured in phones per second. Notice that the duration of silences and pauses is not computed.

Using this metric, the variability of each corpus in terms of speed can be measured. The obtained results can be found in Figure 3 and in Table 2.

Based on the data, it can be seen that the utterance speed conforms to a normal distribution in the three corpora.

4. Voice Cloning System

As discussed in Section 2, the Voice Cloning community has introduced various systems within the Text-to-Speech and Voice Conversion frameworks. In this study, the primary system employed for training is Tacotron-2. Although there exist newer and faster architectures for Voice Cloning, Tacotron-2 is a well-established system in the TTS field, acknowledged by the community. Given that the focus of this study is not to compare different architectures but rather to evaluate the training process of an specific one, and considering that this work is an extension of [1] where Tacotron-2 was utilised, it was decided to maintain the use of this architecture in the experiments conducted for this study.

Two different approaches have been used for training the multiple models used in this work. For the corpus XRey and the subsets derived from the random partition of 3 h of the HQ datasets (refer to Section 5), a fine-tuning approach has been used. Each model started using the weights of a publicly available model5 trained on the LJ Speech corpus [50]. These models were trained on a single GPU with batch-size of 32 and a learning rate of

10^{- 4}

for a total of 50 000 iterations.

In the case of the whole datasets, the training setup corresponds with the original Tacotron-2 recipe [7]: they were trained from scratch on a single GPU with batch-size of 64, using a learning rate of

10^{- 3}

decaying to

10^{- 5}

after 50 000 iterations, for a total of 150 000 training steps. On both approaches Adam Optimiser [51] was used with the following parameters:

β_{1} = 0.9

,

β_{2} = 0.999

and

ε = 10^{- 6}

.

Audios were sampled to 22 050 Hz prior to training, and due to the characteristics of the writing systems of these two languages, the text input of the Spanish corpora was left as characters but phonemes were used on the English dataset.

Regarding the vocoder, the Hi-Fi GAN [8] architecture was chosen, which is a GAN that employs a generator based on a feed-forward WaveNet [52] supplemented with a 12 1D convolutional postnet. In this study, the UNIVERSAL_V1 model was selected as the primary vocoder. This model was trained on the LibriSpeech [43], VCTK [53] and LJ Speech [50] datasets, and was publicly released by the authors6.

5. Experimental framework

In this section, the final datasets used for training the TTS models and the quality evaluation procedure are presented in detail.

5.1. Postprocessed Datasets

As it was claimed before, one of the key aspects on training a Voice Cloning system based on a monolingual TTS approach is the homogeneity of the data. In order to reduce the variability of the gathered corpora, three main decisions were made within the data selection phase:

Removing the sentences whose SNR value is lower than 20 dB in order to ensure a sufficient signal quality. A value of 25 dB was chosen for the HQ datasets since the quality of these audios is notably higher (refer to Figure 2 and Table 1).
Removing the sentences whose utterance speed value is inside the first or last deciles. More specifically, maintaining audios with $14.17 < S < 16.88$ for XRey, $13.16 < S < 16.75$ for Tux, and $10.27 < S < 14.06$ for Hi-Fi TTS 92 (refer to Figure 3 and Table 2).
Only audios of durations between 1 and 10 seconds were used in order to reduce variability and increase batch size during training.

Due to the lack of data for corpus XRey and the particularly long utterances of its speaker, the original audios were divided on pauses found inside sentences instead of discarding whole audios in order to ensure that a significant part of the dataset was not lost.

With regard to the HQ datasets, in order to reproduce the difficult conditions in terms of quantity of data found for XRey, a random partition of 3 hours of audios was chosen from each HQ dataset, composed of 1 h of audios from the ranges of 1 to 3 seconds, 3 to 7 seconds, and 7 to 10 seconds. These datasets have also been reduced in terms of SNR and utterance speed. This process resulted in a total of 20 different collections of audios shown in Table 3.

The impact of these ablations has been measured with multiple trainings of Voice Cloning systems as explained in Section 6.

5.2. Quality measurement

The final step of the Voice Cloning evaluation framework proposed in this work is to test the quality of each of the checkpoints that have been generated during the training of the 20 models by using objective metrics. For that purpose, a set of various metrics has been gathered. These metrics can be classified into two different categories: MOS estimators and alignment metrics.

5.2.1. MOS estimators

MOS estimators typically use deep learning algorithms in order to predict the MOS score of an individual cloned signal. As an advantage, they do not require the existence of a ground truth audio. In this work, NISQA7 [9] and MOSnet8 [4] were chosen as contrasting MOS estimator models.

5.2.2. Alignment metrics

These metrics aim to obtain the number of correctly generated cloned utterances by matching the input text sequence with the resulting waveform. As an example, an estimation of the number of correct sentences can be computed by means of an automatic force aligner such as MFA, trying to match the generated audios with the input texts with the lowest beam possible. However, a successful forced alignment does not ensure that a particular sentence has been correctly generated.

In this context, we propose a complementary alignment metric that takes advantage of the characteristics of the Voice Cloning system Tacotron-2, and specifically, its attention matrix, for computing the number of input characters that have been correctly synthesised on each sentence. Given that the weights of the attention matrix for a successfully generated audio should present a diagonal line, its presence can be easily checked by using a series of sliding rectangular windows:

Let

A = (a_{i j})

be an attention matrix of dimensions

E \times D

where E is the length of the input sequence and D the length of the output spectrogram; given a width w and a height h and starting at

x = 0

,

y = 0

, every element

a_{i j}

such that

y < i \leq y + h

and

x - \frac{1}{3} w < j \leq x + \frac{2}{3} w

is checked to be higher than a given threshold value

θ

. If there exists a value

a_{i j} > θ

inside this rectangle, then the i-th input character is considered to be correctly aligned. This process is then repeated sliding the rectangle to the position of the last correctly aligned character until its uppermost point exceeds E, its rightmost point exceeds D, or no correctly aligned character is found inside the region. The algorithmic implementation is shown in Algorithm 1 and an example of this procedure can be found in Figure 4.

Algorithm 1 Character alignment algorithm using the attention matrix of Tacotron-2

1:: function Aligned characters( $A, w, h, θ$ )
2:: $x, y \leftarrow 0, 0$
3:: $n \leftarrow 0$
4:: while $y + h < E$ and $x + \frac{2}{3} w < D$ do
5:: $S \leftarrow \{a_{i j} : a_{i j} > θ \land y < i \leq y + h \land x - \frac{1}{3} w < j \leq x + \frac{2}{3} w\}$
6:: $c \leftarrow |\{i : a_{i j} \in S\}|$
7:: if $c = 0$ then
8:: break
9:: end if
10:: $n \leftarrow n + c$
11:: $y \leftarrow max \{i : a_{i j} \in S\}$
12:: $x \leftarrow max \{j : a_{i j} \in S\}$
13:: end while
14:: return n
15:: end function

In this work, all the aligned characters are calculated with the following values:

w = 150

spectrogram windows,

h = 8

characters and

θ = 0.7

as they performed the best in our previous experiments.

6. Evaluation results and discussion

Using the postprocessed datasets explained in Section 5.1, a total of 20 voice cloning models have been trained: three different speakers – XRey, Tux and Hi-Fi TTS 92 – with the HQ corpora using a random 3 h partition or the whole dataset, and each of them discarding or not the audios considered more variable in terms of SNR and utterance speed.

As an evaluation set, a series of out of train sentences have been gathered for each language: 207 sentences in Spanish, taken from the subset other of the speaker Tux; and 221 in English, taken from substrings of the texts corresponding to audios with duration longer than 10 s of speaker HiFi-TTS 92. These collections of texts were synthesised each 5 000 training steps in order to compute the aforementioned evaluation metrics on these generated audios.

6.1. Evaluation of XRey

This subsection presents the results of the evaluation performed on the models trained on the corpus XRey. For that purpose, the following metrics have been used: fraction of correctly aligned sentences, fraction of correctly aligned characters, and MOS estimation by MOSnet and NISQA.

The results of these metrics can be found in Figure 5.

The results presented in Figure 5 provide clear evidence of substantial improvement across all four measured metrics following the exclusion of audios with high variability from the training set. It is worth mentioning that even in the case of the most drastic ablation, where 50 % of the training data was removed, the resulting metrics displayed noteworthy enhancement.

Regarding the fraction of aligned sentences, it can be seen that it reaches values near 100 % in early stages of training, showing that this metric is not really suitable for comparing the quality of a given Voice Cloning system. The aligner is, however, performing in the way it was conceived, that is, trying to correctly align the highest possible number of sentences.

The fraction of aligned characters, in contrast, does indeed show a rising trend while the training progresses. This metric, as it is shown in the measurements for the other training configurations, rarely reaches values near 100 %, since not every input character should have a direct effect on a particular step of the output spectrogram, particularly in the case of spaces or punctuation marks.

Finally, regarding the two selected MOS estimator models, it can be observed that MOSnet is significantly more generous than NISQA in this environment. Nevertheless, these MOS values indicate a considerable quality, specially for MOSnet, taking into account that the training audios have been recorded in the third quarter of the 20^th century.

6.2. Evaluation of HQ speakers trained on 3 h

This subsection presents the results of the evaluation performed on the models trained with a random 3 h partition and their corresponding ablations as described in Subsection 5.1.

The metrics regarding the two MOS estimators NISQA and MOSnet on the 8 different trainings corresponding to the these subsets are displayed in Figure 6.

As it can be observed from the results displayed in Figure 6, there is no significant change in estimated MOS values when comparing the models that use more data with those for which some audios have been removed. In addition, the speaker Hi-Fi TTS 92 tends to have a higher estimated MOS value than speaker Tux. Both speakers have an estimated MOS value between 3 and 3.5 for NISQA and between 2.3 and 3.2 for MOSnet, concluding that NISQA is more generous in term of MOS estimation than MOSnet for these two speakers, just the opposite that happened with speaker XRey. In any case, it can be stated as the main conclusion of this graph that the impact of having more variable audios in the training dataset does not impact the quality of the models significantly, and therefore adding a set of more variable audios to the train does not necessarily guarantee a better final result.

In terms of sentence and character alignments, the results of the models trained on the random 3 h partitions are portrayed in Figure 7.

Similarly to the speaker XRey, the percentage of aligned sentences using MFA approaches the 100% on the early stages of training, which confirms that this metric is not the most suitable for this particular task. Regarding character alignments, however, an improving evolution can be observed as the training progresses. Even though removing audios with lower SNR has a noticeable impact in the number of characters aligned for Tux, having less audios with more variability in terms of utterance speed does not influence the number of aligned characters significantly.

It is quite noticeable, nonetheless, that removing audios with lower SNR and with more variable utterance speed has a better impact on the training than removing audios with lower SNR only for both speakers. This could be due to the fact that these two datasets have relatively low values of noise, and therefore the audios whose SNR value is lower than 25 dB do not necessarily increase the variability in the data, in contrast with the utterance speed, with a similar impact as having a lower quantity of audios.

6.3. Evaluation of HQ speakers trained on the whole corpora

Subsequently, the evaluation results derived from the 8 models that were trained using both the complete corpora and their respective ablations will be presented, as detailed in Subsection 5.1.

The metrics regarding the two MOS estimators NISQA and MOSnet on the 8 different trainings corresponding to the corpora derived from the whole datasets are displayed in Figure 8.

One of the key findings derived from the MOS estimations obtained through NISQA and MOSnet is that the trends and values observed in the advanced stages of training are remarkably similar between models trained with random 3-hour partitions and those trained with the entire corpus. Notably, Speaker Tux achieves values close to 3.0 for NISQA and 2.5 for MOSnet, while Hi-Fi TTS 92 attains higher values of 3.3 for NISQA and 3.1 for MOSnet. These experiments suggest that similar quality can be achieved when training a 3-hour corpus from a pre-trained model compared to a more voluminous corpus trained from scratch, at least based on these MOS estimators.

Finally, the information regarding the fraction of aligned characters and sentences is present in Figure 9.

Similarly to the previous examples, once the fraction of aligned sentences approaches the maximum value of 1 in latter stages of training, this particular metric has no substantial information regarding the quality of the synthesised audios.

Concerning to the fraction of aligned characters, however, two contrary trends can be witnessed for the two speakers. In the case of Tux, the performed ablations have had a positive impact in the fraction of aligned input characters, even in earlier stages of training. In relation to the speaker Hi-Fi TTS 92, however, the opposite holds true, since the removal of data that was considered of higher variability has affected negatively this particular metric. This discrepancy may arise from the notable disparity in the training data between the two corpora, since speaker Tux possesses a considerably larger number of audio hours compared to Hi-Fi TTS 92. Consequently, it becomes feasible for Tux than for Hi-Fi TTS 92 to exclude more challenging audios from the training process without compromising the quality of the final model, at least as this metric concerns.

7. Conclusions

In this work, we extended the quality evaluation on a voice cloning task from a difficult corpus from our previous project [1], and compared the results on a set of two higher quality voice cloning corpora in both English and Spanish.

We first evaluated the quality of each corpora in terms of phonetic coverage, SNR and utterance speed. We performed an exhaustive evaluation on these features in order to detect which partitions of audios could infer higher variability and therefore be detrimental to the quality of a Voice Cloning model.

Using that data, a set of ablations was performed to the original datasets by removing audios that were considered of lower quality – in terms of low SNR – or higher variability – concerning utterance speed – from the training partitions. Audios with lower than 20 dB of SNR for the more difficult speaker and 25 dB for the higher quality datasets were removed. Similarly, audios whose utterance speed was in the first and last deciles were also withdrawn from data. In addition, and since the quantities of the three corpora are not fully comparable, these same ablations have been applied to a randomly chosen 3 h subset of the two corpora featuring higher quality. In this regard, a total of 20 models have been trained in order to check the impact of said audio removals.

In order to automatically check the quality of the trained models, we gathered a set of 4 different measurements. First, two different MOS estimators based on deep learning techniques were employed: NISQA [9] and MOSnet [4]. Additionally, we introduced a novel algorithm in order to complement the forced alignment of sentences. This approach takes advantage of the diagonal present in a successful synthesis in the attention matrix from the Voice Cloning system Tacotron-2 in order to count the number of correctly aligned input characters.

With the aid of these measurements, we proved that removing audios that were considered noisier or that featured a more variable utterance speed from the more difficult dataset improved the overall quality of the final models when starting from a pre-trained model, even though that half of the audios were withdrawn from training in the harshest ablation. Moreover, the estimated MOS increased around 0.2 points for both algorithms. Regarding the two datasets that featured a higher quality, however, this trend does not apply. Since the level of noise present in these two corpora is comparatively low, the impact of only removing audios with lower values of SNR has not been positive to the training of the models. Nevertheless, the quality of the models with less data but more homogeneous utterance speed can be considered to be equal as those models with a higher amount of audio when using a pre-training approach in terms of the fraction of aligned characters. Finally, when training from scratch using the whole datasets, the impact of removing more variable audios is seen to be negative for the speaker with a less number of hours in terms of the fraction of aligned characters. Regarding the estimated MOS, however, the ablations did not show any significant improvement or deterioration of the MOS values estimated by NISQA and MOSnet, not in the pre-trained framework with 3 h of audio nor when training from scratch with the whole dataset.

In our future research, we will complement these automatic evaluations with a subjective evaluation based on real MOS using human evaluators. Moreover, we will focus on the acquisition of novel metrics that leverage prosodic information to automatically identify audios exhibiting higher variability within a specific Voice Cloning dataset, accompanied by corresponding ablations for assessing the impact of their heterogeneity in the final quality. Finally, these evaluations will be performed on different Voice Conversion architectures in order to check whether these results can be extrapolated to multiple TTS frameworks.

Author Contributions

Conceptualisation, A.GD. and A.A.; methodology, A.GD. and A.A.; software, A.GD.; validation, A.GD. and A.A.; formal analysis, A.GD. and A.A.; investigation, A.GD. and A.A.; resources, A.GD. and A.A.; data curation, A.GD.; writing—original draft preparation, A.GD.; writing—review and editing, A.GD. and A.A.; visualisation, A.GD. and A.A.; supervision, A.A.; project administration, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The two public corpora presented in this study are openly available in https://github.com/carlfm01/my-speech-datasets and https://www.openslr.org/109/ [36]. Due to the particularities involving the personality of the corpus XRey, the data will not be shared.

Acknowledgments

In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
GMM	Gaussian Mixture Model
HMM	Hidden Markov Model
HQ	High Quality, refers to speakers Tux and Hi-Fi TTS
MFCC	Mel Frequency Cepstral Coefficient
MOS	Mean Opinion Score
MFA	Montreal Forced Aligner
NISQA	Non-Intrusive Speech Quality Assessment
SNR	Signal-to-Noise Ratio
TTS	Text-to-Speech
VAD	Voice Activity Detection

References

González-Docasal, A.; Álvarez, A.; Arzelus, H. Exploring the limits of neural voice cloning: A case study on two well-known personalities. In Proceedings of the Proc. IberSPEECH 2022, 2022; pp. 11–15. [Google Scholar] [CrossRef]
Dale, R. The voice synthesis business: 2022 update. Natural Language Engineering 2022, 28, 401–408. [Google Scholar] [CrossRef]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. In Proceedings of the Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016; p. 125.
Lo, C.C.; Fu, S.W.; Huang, W.C.; Wang, X.; Yamagishi, J.; Tsao, Y.; Wang, H.M. MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion. In Proceedings of the Proc. Interspeech 2019, 2019; pp. 1541–1545. [Google Scholar] [CrossRef]
Cooper, E.; Huang, W.C.; Toda, T.; Yamagishi, J. Generalization Ability of MOS Prediction Networks, 2022. arXiv:2110.02635 [eess]. [CrossRef]
Cooper, E.; Yamagishi, J. How do Voices from Past Speech Synthesis Challenges Compare Today? In Proceedings of the Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021; pp. 183–188. [CrossRef]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018; pp. 4779–4783. [CrossRef]
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020, Vol. 33; pp. 17022–17033.
Mittag, G.; Naderi, B.; Chehadi, A.; Möller, S. NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. In Proceedings of the Interspeech 2021. ISCA, 2021; pp. 2127–2131. [CrossRef]
Veerasamy, N.; Pieterse, H. Rising Above Misinformation and Deepfakes. In Proceedings of the International Conference on Cyber Warfare and Security, 2022, Vol. 17; pp. 340–348.
Pataranutaporn, P.; Danry, V.; Leong, J.; Punpongsanon, P.; Novy, D.; Maes, P.; Sra, M. AI-generated characters for supporting personalized learning and well-being. Nature Machine Intelligence 2021, 3, 1013–1022. [Google Scholar] [CrossRef]
Salvador Dalí Museum. Dalí Lives (via Artificial Intelligence). https://thedali.org/press-room/dali-lives-museum-brings-artists-back-to-life-with-ai/. Accessed: 2023-05-16.
Aholab, University of the Basque Country. AhoMyTTS. https://aholab.ehu.eus/ahomytts/. Accessed: 2023-05-16.
Doshi, R.; Chen, Y.; Jiang, L.; Zhang, X.; Biadsy, F.; Ramabhadran, B.; Chu, F.; Rosenberg, A.; Moreno, P.J. Extending Parrotron: An end-to-end, speech conversion and speech recognition model for atypical speech. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021; pp. 6988–6992. [Google Scholar] [CrossRef]
Google Research. Project Euphonia. https://sites.research.google/euphonia/about/. Accessed: 2023-05-16.
Jia, Y.; Cattiau, J. Recreating Natural Voices for People with Speech Impairments. https://ai.googleblog.com/2021/08/recreating-natural-voices-for-people.html/, 2021. Accessed: 2023-05-16.
The Story Lab. XRey. https://open.spotify.com/show/43tAQjl2IVMzGoX3TcmQyL/. Accessed: 2023-05-16.
Chou, J.c.; Lee, H.Y. One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. In Proceedings of the Interspeech 2019. ISCA, 2019; pp. 664–668. [CrossRef]
Qian, K.; Zhang, Y.; Chang, S.; Yang, X.; Hasegawa-Johnson, M. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In Proceedings of the Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019; pp. 5210–5219, ISSN 2640-3498.
Qian, K.; Zhang, Y.; Chang, S.; Xiong, J.; Gan, C.; Cox, D.; Hasegawa-Johnson, M. Global Prosody Style Transfer Without Text Transcriptions. In Proceedings of the Proceedings of the 38th International Conference on Machine Learning; Meila, M.; Zhang, T., Eds. PMLR, 2021, Vol. 139, Proceedings of Machine Learning Research, pp. 8650–8660.
Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO); IEEE: Rome, 2018; pp. 2100–2104. [Google Scholar] [CrossRef]
Zhou, K.; Sisman, B.; Li, H. Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data. In Proceedings of the Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020; pp. 230–237. [CrossRef]
Zhou, K.; Sisman, B.; Li, H. Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), 2021; pp. 415–422. [CrossRef]
van Niekerk, B.; Carbonneau, M.A.; Zaïdi, J.; Baas, M.; Seuté, H.; Kamper, H. A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion. In Proceedings of the ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022; pp. 6562–190. [CrossRef]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech and Language Processing 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.Y. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, 2022, [arXiv:eess.AS/2006.04558]. [CrossRef]
Kim, J.; Kim, S.; Kong, J.; Yoon, S. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc., 2020; Vol. 33, pp. 8067–8077. [Google Scholar]
Casanova, E.; Shulby, C.; Gölge, E.; Müller, N.M.; de Oliveira, F.S.; Candido Jr., A.; da Silva Soares, A.; Aluisio, S.M.; Ponti, M.A. SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. In Proceedings of the Proc. Interspeech 2021, 2021; pp. 3645–3649. [Google Scholar] [CrossRef]
Mehta, S.; Kirkland, A.; Lameris, H.; Beskow, J.; Éva Székely.; Henter, G.E. OverFlow: Putting flows on top of neural transducers for better TTS, 2022, [arXiv:eess.AS/2211.06892]. [CrossRef]
Kumar, K.; Kumar, R.; de Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; de Brébisson, A.; Bengio, Y.; Courville, A.C. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.d., Fox, E., Garnett, R., Eds.; Curran Associates, Inc., 2019; Vol. 32. [Google Scholar]
gil Lee, S.; Ping, W.; Ginsburg, B.; Catanzaro, B.; Yoon, S. BigVGAN: A Universal Neural Vocoder with Large-Scale Training, 2023, [arXiv:cs.SD/2206.04658]. [CrossRef]
Bak, T.; Lee, J.; Bae, H.; Yang, J.; Bae, J.S.; Joo, Y.S. Avocodo: Generative Adversarial Network for Artifact-free Vocoder, 2023, [arXiv:eess.AS/2206.13404]. [CrossRef]
Kim, J.; Kong, J.; Son, J. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, 2021, [arXiv:cs.SD/2106.06103]. [CrossRef]
Casanova, E.; Weber, J.; Shulby, C.; Junior, A.C.; Gölge, E.; Ponti, M.A. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone, 2023, [arXiv:cs.SD/2112.02418]. [CrossRef]
Wang, C.; Chen, S.; Wu, Y.; Zhang, Z.; Zhou, L.; Liu, S.; Chen, Z.; Liu, Y.; Wang, H.; Li, J.; et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, 2023. arXiv:2301.02111 [cs, eess]. [CrossRef]
Bakhturina, E.; Lavrukhin, V.; Ginsburg, B.; Zhang, Y. Hi-Fi Multi-Speaker English TTS Dataset, 2021. [CrossRef]
Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R.J.; Jia, Y.; Chen, Z.; Wu, Y. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proceedings of the Interspeech 2019. ISCA, 2019; pp. 1526–1530. [CrossRef]
Torres, H.M.; Gurlekian, J.A.; Evin, D.A.; Cossio Mercado, C.G. Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Language Resources and Evaluation 2019, 53, 419–447. [Google Scholar] [CrossRef]
Gabdrakhmanov, L.; Garaev, R.; Razinkov, E. RUSLAN: Russian Spoken Language Corpus for Speech Synthesis. In Proceedings of the Speech and Computer; Salah, A.A.; Karpov, A.; Potapova, R., Eds.; Springer International Publishing: Cham, 2019. Lecture Notes in Computer Science. pp. 113–121. [Google Scholar] [CrossRef]
Srivastava, N.; Mukhopadhyay, R.; K R, P.; Jawahar, C.V. IndicSpeech: Text-to-Speech Corpus for Indian Languages. In Proceedings of the Proceedings of the Twelfth Language Resources and Evaluation Conference; European Language Resources Association: Marseille, France, 2020; pp. 6417–6422. [Google Scholar]
Ahmad, A.; Selim, M.R.; Iqbal, M.Z.; Rahman, M.S. SUST TTS Corpus: A phonetically-balanced corpus for Bangla text-to-speech synthesis. Acoustical Science and Technology 2021, 42, 326–332. [Google Scholar] [CrossRef]
Casanova, E.; Junior, A.C.; Shulby, C.; Oliveira, F.S.d.; Teixeira, J.P.; Ponti, M.A.; Aluísio, S. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation 2022, 56, 1043–1055. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015; pp. 5206–5210. [CrossRef]
Zandie, R.; Mahoor, M.H.; Madsen, J.; Emamian, E.S. RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis. In Proceedings of the Interspeech 2021. ISCA, 2021; pp. 2751–2755. [CrossRef]
Torcoli, M.; Kastner, T.; Herre, J. Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of Their Application Domain Dependence. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021, 29, 1530–1541. [Google Scholar] [CrossRef]
Mittag, G.; Möller, S. Deep Learning Based Assessment of Synthetic Speech Naturalness. In Proceedings of the Interspeech 2020. ISCA, 2020; pp. 1748–1752. [CrossRef]
Huang, W.C.; Cooper, E.; Tsao, Y.; Wang, H.M.; Toda, T.; Yamagishi, J. The VoiceMOS Challenge 2022. In Proceedings of the Interspeech 2022. ISCA, 2022; pp. 4536–4540. [CrossRef]
McAuliffe, M.; Sonderegger, M. Spanish MFA acoustic model v2.0.0a. Technical report, https://mfa-models.readthedocs.io/acoustic/Spanish/SpanishMFAacousticmodelv200a.html, 2022.
McAuliffe, M.; Sonderegger, M. English MFA acoustic model v2.0.0a. Technical report, https://mfa-models.readthedocs.io/acoustic/English/EnglishMFAacousticmodelv200a.html, 2022.
Ito, K.; Johnson, L. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR (Poster); Bengio, Y.; LeCun, Y., Eds., 2015.
Rethage, D.; Pons, J.; Serra, X. A Wavenet for Speech Denoising. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018; pp. 5069–5073. [CrossRef]
Yamagishi, J. English multi-speaker corpus for CSTR voice cloning toolkit. https://datashare.ed.ac.uk/handle/10283/3443/, 2012.

1	https://montreal-forced-aligner.readthedocs.io
2	https://open.spotify.com/episode/0Vkoa3ysS998PXkKNuh9m2
3	https://librivox.org/reader/3946
4	https://github.com/carlfm01/my-speech-datasets
5	https://github.com/NVIDIA/tacotron2
6	https://github.com/jik876/hifi-gan
7	https://github.com/gabrielmittag/NISQA
8	https://github.com/lochenchou/MOSNet

Figure 1. Absolute frequencies of diphones for the speaker XRey (green), the valid subset of the Tux corpus (orange), and the speaker 92 of Hi-Fi TTS (blue). x axis is normalised to the maximum number of diphones in each language (386, 446 and 1255 respectively).

Figure 2. Density of SNR values calculated by using the information of the forced alignments of XRey (green), subset valid of Tux dataset (blue), and the speaker 92 from Hi-Fi TTS (orange).

Figure 3. Density of Utterance Speed values of the speaker XRey (green), the subset valid of Tux (orange), and speaker 92 from Hi-Fi TTS (blue).

Figure 4. An example of an attention matrix of a decoding in Tacotron-2 (left) and once highlighting the regions processed by the character alignment algorithm (right).

Figure 5. Fraction of aligned characters (above left) and sentences (above right), and estimated MOS values obtained from NISQA (below left) and MOSnet (below right) for generated audios on each iteration for the speaker XRey.

Figure 6. Estimated MOS values obtained from NISQA (above) and MOSnet (below) for generated audios on each iteration for the HQ speakers Tux (left) and Hi-Fi TTS 92 (right) on the random 3 h partitions.

Figure 7. Fraction of aligned sentences (above) and characters (below) for generated audios on each iteration for the HQ speakers Tux (left) and Hi-Fi TTS 92 (right) on the random 3 h partitions.

Figure 8. Estimated MOS values obtained from NISQA (above) and MOSnet (below) for generated audios on each iteration for the HQ speakers Tux (left) and Hi-Fi TTS 92 (right) trained on the whole corpora.

Figure 9. Fraction of aligned sentences (above) and characters (below) for generated audios on each iteration for the HQ speakers Tux (left) and Hi-Fi TTS 92 (right) trained on the whole corpora.

Table 1. Minimum, maximum, average, median and standard deviation values of SNR calculated on XRey, on the valid subset of Tux corpus, and on speaker 92 of Hi-Fi TTS corpus.

Speaker	Min.	Max	Mean	Median	Stdev
XRey	0.72	32.81	21.15	21.27	3.84
Tux valid	-20.34	97.90	36.48	37.36	8.75
Hi-Fi TTS 92	-17.97	73.86	40.57	39.48	9.62

Table 2. Minimum, maximum, average, median and standard deviation values of Utterance Speed calculated on XRey, on the valid subset of Tux corpus, and on speaker 92 of Hi-Fi TTS corpus.

Speaker	Min.	Max	Mean	Median	Stdev
XRey	11.68	20.11	15.46	15.42	1.09
Tux valid	6.25	22.22	12.20	12.23	1.49
Hi-Fi TTS 92	7.46	26.32	15.03	15.14	1.46

Table 3. Comparison on number of files and hours of XRey, Tux and speaker 92 of Hi-Fi TTS once removing audios considered to increase variability. Superscripts ¹, ² and ³ correspond to the changes proposed in Subsection 5.1. May the reader note that the files of speaker XRey were divided in order to obtain shorter audios.

Speaker	All		¹High SNR		²Utt. Speed		^1,2SNR & speed
Speaker	Files	Hours	Files	Hours	Files	Hours	Files	Hours
XRey	1 075	3:13	588	1:49	792	2:25	493	1:33
³ $1 < t < 10$	2 398	2:46	1 451	1:35	1 978	2:08	1 249	1:23
Tux valid	52 398	53:46	44 549	47:21	38 395	44:16	33 889	40:33
³ $1 < t < 10$	46 846	45:08	40 111	39:43	35 503	36:46	31 345	33:31
3 h partition	3 092	3:00	2 649	2:39	2 326	2:27	2 061	2:15
Hi-Fi TTS 92	35 296	27:18	31 634	25:02	25 975	21:40	23 996	20:16
³ $1 < t < 10$	33 589	26:34	30 374	24:27	25 838	21:19	23 889	19:58
3 h partition	3 131	3:00	2 835	2:45	2 486	2:31	2 301	2:21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.