Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

One of the obstacles in developing speech emotion recognition (SER) systems is the data scarcity problem, i.e., the lack of labeled data for training these systems. Data augmentation is an effective method for increasing the amount of training data. In this paper, we propose a cycle-generative adversarial network (cycle-GAN) for data augmentation in the SER systems. For each of the five emotions considered, an adversarial network is designed to generate data that have a similar distribution to the main data in that class but have a different distribution to those of other classes. These networks are trained in an adversarial way to produce feature vectors similar to those in the training set, which are then added to the original training sets. Instead of using the common cross-entropy loss to train cycle-GANs, we use the Wasserstein divergence to mitigate the gradient vanishing problem and to generate high-quality samples. The proposed network has been applied to SER using the EMO-DB dataset. The quality of the generated data is evaluated using two classifiers based on support vector machine and deep neural network. The results showed that the recognition accuracy in unweighted average recall was about 83.33%, which is better than the baseline methods compared.


Introduction
The data scarcity problem is one of the critical challenges in developing speech emotion recognition (SER) systems. This problem can be examined from three aspects. The first aspect is related to the dramatized emotions (generated by actors), used to avoid legal and moral issues [1]. The second aspect is the mislabeling of the emotions of the speakers, and the third issue is related to the lack of balance in the number of samples available for each class. To train an emotion classi- Some standard data augmentation techniques used for images such as transfer and rotation [2] may not be applicable for text or speech. Synonymous substitution [3], which is mainly used to process text, is not appropriate for emotion classification and recognition from speech. Similarly, traditional data augmentation methods for speech, such as changes in voice and sound velocity [4], are also inappropriate for images or texts. In contrast, the data augmentation method based on generative adversarial networks (GANs) [5] is focused on learning and simulating real data distribution and is independent of the tasks. Therefore, GANs-based data augmentation method is our focus in this paper.
Recently, end-to-end methods are used for speech emotion recognition [6,7], where the input to the systems are feature vectors and the output is class labels. In [8], the features are extracted by convolution filters. With the development of DNNs in SER, various data augmentation methods have been proposed [9,10]. Transfer learning can be used to address the data sparsity problem [11], e.g., in image and speech processing [12]. Deng et al. proposed a transfer learning method by transferring knowledge learned from source domain data to the target domain data [9].
One of the effective methods to augment data is the GANs introduced by Goodfellow et al. [5], which was shown to improve image recognition performance [13]. Zhang et al. introduced GAN to produce high-dimensional data and showed that data augmentation by GANs performs better than the typical data augmentation techniques [14], such as time warping, frequency masking, and time masking. Hybrid methods include four different combinations: LibriSpeech basic (LB), LibriSpeech doubles (LD), Switchboard mild (SM), and Switchboard strong (SS) [15].
Cycle adversarial data augmentation networks use Jensen-Shannon divergence as a divergence criterion. According to [16], if two data distributions are less overlapped or not overlapped, Jensen-Shannon divergence tends to be constant, which can lead to a gradient vanishing problem. The method proposed in this study can address this problem. In training, source and target data distributions are significantly overlapped, which makes it difficult for the discriminator to distinguish between these two vector groups. As a result, the discriminator network leads to increased cross-entropy errors, and the generator network then receives a gradient error. Moreover, with the adversarial data augmentation networks, other divergence methods such as the Wasserstein divergence can be easily used for gradient descent. As compared with the Jensen-Shannon divergence, the Wasserstein divergence can measure the distance between two data distributions even if they are not overlapped. The hidden space generated by adversarial data augmentation networks also makes it easy to learn emotional information due to the low dimensions of the vectors in the training data. In addition, practical programs [17,18] have shown that models with the Wasserstein divergence are better than those with other divergences, such as Jensen-Shannon divergence and maximum mean discrepancy. Therefore, the Wasserstein divergencebased adversarial data augmentation may offer improved performance in emotion recognition.
In this paper, we present a cycle-GAN for data augmentation and then test it on SER with two classifier networks. The cycle-GAN generates samples similar to actual data thereby augmenting the dataset with additional samples for emotion classification. In addition, we study the effectiveness of the GANs and replace the standard cross-entropy error by the Wasserstein divergence to train the GAN to improve the classification performance. We evaluate the method using the EMO-DB database. The results show that the proposed data augmentation technique improves the SER performance on the EMO-DB dataset and the cycle-GAN with Wasserstein divergence outperforms the cycle-GAN with the conventional cross-entropy loss. We show that the synthetic samples generated from an ordinary cycle-GAN cover part of the actual data while the clusters created by the cycle-GAN using Wasserstein distance (artificial samples generated from our method) completely cover the feature space for each five emotion classes. Section 2 reviews existing methods for the data scarcity problem. Section 3 proposes the suggested network design and provides theoretical analysis. Section 4 presents experiment details, including dataset, features, experimental setup, and evaluation protocols. Section 5 analyzes experimental results. Finally, Sect. 6 concludes the paper.

Related work
To address the data scarcity problem, we can use data augmentation methods to expand the dataset by generating new samples using techniques, such as adding noise to the data [19], pitch shifting and time-stretching of the audio signal, varying the loudness of the speech signal, applying random frequency filters, and interpolating between samples in input space. However, these methods usually change the data, and may cause problems or introduce artefacts into the data, such as rotation, adding noise, speech echoing, and signal clipping [20]. Advanced data augmentation methods are based on GANs and their variants, such as conditional-GANs and cycle-GANs. Hu et al. used a deep convolutional neural network to produce extra features to train acoustic models and showed that data augmentation can improve the performance of speech recognition systems [21]. Sahu et al. synthesized feature vectors with automatic adversarial encoders using Gaussian mixed noise in the generator network [22]. Sahu et al. also developed a model based on a Conditional-GANs to generate artificial feature vectors [10]. Several methods were used to train the conditional-GANs, including generator initialization with detector weights, as well as using an automatic adversarial encoder.
One fundamental issue in training GANs is that the generator and the discriminator are trained in parallel. Dynamic alternating training [14] can be used so that the number of training epochs in the generator network and the discriminator network do not have to match. This is because the ultimate goal is not about the number of training epochs, but the amount of training in each network.

Generative adversarial networks
As mentioned before, GANs consist of two deep neural networks. The generator network produces synthetic data, and the discriminator network distinguishes the real data from the synthetic data. The loss function of GANs can be expressed as follows [23]: where D is a discriminator, G is a generator, Z is noise, p data (x) is the original data distribution, and p z (z) is the input noise distribution. In practice, according to [24], we train G to maximize log D(x), instead of training G to minimize log(1 − DG(z)). This objective function can mitigate the vanishing gradient problem without compromising the equilibrium point of G and D. Figure 1 shows the network architecture designed. The entire process of training a GAN is shown in Algorithm 1.

Algorithm 1 Training a GAN in the vanishing gradient method
Repeat for the number of training epoches: While the stopping criterion is not met do: for each k do: Sample m data points with distribution p z (z). z = z (1) , z (2) , z (3) , . . . , z (m) Sample m data points with initial distribution p.
Calculate the loss of the discriminator network: Update the generator weights with the gradient descent method:

Cycle-generative adversarial networks
Cycle-GANs have been used for image generation for nonpaired data [25]. Figure 2 shows the architecture of cycle- Fig. 2 The structure of cycle-GANs GAN for data augmentation [26]. This network includes two transfer functions: F and G, where G learns how to transfer samples from the source domain S to target domain T and F is the inverse of G. Both F and G may be considered as generators to produce the target and source data, respectively. Moreover, there are two adversarial discriminator networks, D T and D S , where D T discriminates real targets from the synthetic targets, while D S discriminates real sources from synthetic sources. This network sets its target so that F(G(S)) ≈ S and G(F(T )) ≈ T . Therefore, it is called cycle-GAN [27].
The loss used in cycle-GANs includes the adversarial loss and the cycle consistency loss. Removing adversity may be transformed into a part of target data generation and a part of source data generation. The loss function for target data generation is as follows: [27]: Losses are expressed as value functions. In the generation process, the objective is min where the L 1 norm may be substituted with other criteria in these losses. The total losses for cycle-GANs are as follows: where λ controls the relative importance of both losses [27].
We conducted an ablation study to analyze the impact of the proposed regularization term L cyc by varying the corresponding weight λ using the EMO-DB dataset and observed that increasing λ improves both the quality and diversity of the generated samples. Nevertheless, as the weighting parameter λ becomes larger than a threshold value, e.g., 1.0, the training becomes unstable, which results in low quality, and even low diversity of synthesized samples. As a result, we empirically set the weighting parameter λ = 1.0 for all the experiments.

The proposed method
For a labeled dataset X with N emotional classes, artificial samples for each emotion are generated using a separate cycle-GAN. According to Fig. 3, cycle-GAN transfers between source domain S and target domain T i , where S is a dataset without labels and T i is emotional sample i in the labeled dataset X . Discriminator networks D T i and D S i are used to identify the artificial target which cannot be distinguished from real samples. The generator loss and discriminator loss are introduced by L GAN The cycle loss function can have an impact on the number training epochs in the cycle-GANs. Therefore, we translate the synthetic target G i (S) back to the source domain and compute the mean squared error (MSE) between the real source S and reconstructed data F i G i (S). This is similarly done for T i and reconstructed target data G i (S). As a result, the total loss function in each cycle will be as follows:

Overcoming gradient vanishing problem in training cycle-GANs
To overcome the gradient vanishing problem in cycle-GANs, we suggest using the Wasserstein distance. In extreme case, the gradient descent may be stopped during the process of weight modification and the training of generators and discriminators. Considering two probability distributions P r and P g , the Wasserstein distance is defined as follows: where f L ≤ 1 shows that f satisfies the 1-Lipschitz limitation [28]. It is worth mentioning that W 1 is invariant up to a positive scalar K if the Lipschitz constraint is modified to be K. W 1 is believed to be more suitable for data distributed on low-dimensional manifolds. If the weights are greater or smaller than the expected limit, they will be changed into minimum or maximum predefined. In the gradient penalty method, the gradient penalty is based on Lipschitz, which is derived from the fact that if gradients are at most 1 everywhere, they are 1-Lipschitz functions. Their squared difference from one is used as a gradient penalty. According to [29], weight clipping may lead to a non-optimal solution. Gradient penalty was also applied to overcome weight clipping limitations [17]. However, if there is a data sparsity problem, satisfying the K-Lipschitz condition is difficult for the whole data set. Accordingly, Wu et al. [29] suggested a new Wasserstein divergence, where the Wasserstein distance is calculated without applying Lipschitz condition: where λ controls the effect of gradient modification on the target function and λ > 0. P u is a Radon distribution, and p is related to L p space for function f and p > 1, [29]. Finally, the loss function in the generator and the discriminator is written as follows:  Figure 4 shows that transferring data by cycle-GAN results in similarity between real and artificial data distribution. A classification loss function is used to ensure that the synthetic data can be correctly allocated to their target emotion class, which is defined here as the cross-entropy error:

Recognizing samples generated by cycle-GAN augmentation network
where y i is the label of the target emotions. The total loss is defined as follows: where λ cyc and λ cls are the weights corresponding to the cycle-GAN loss and the classification loss.

Dataset
We performed experiments on the EMO-DB dataset [30], which is a small dataset of 535 training clips with seven emotional classes. All speech signals were recorded by ten professional actors in German. This database includes seven emotions. We used five emotions to perform the experiments: anger (127 samples), fear (69 samples), happiness (71 samples), sadness (62 samples), and neutral (79 samples). We did not use disgust (81 samples) and surprise (46 samples). The data were recorded at a 48 kHz sampling rate and then down-sampled to 16 kHz. The average length of each audio clip is 2.8 seconds.
The other datasets that are often used in speech emotion recognition include the Danish Emotional Speech Database (DES) with 200 samples, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with 2496 samples, the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) with 1150 samples, and the Vera am Mittag Database (VAM) with 1018 samples. There are also audio-visual datasets such as SEWA [31] and MuSe-CAR [32] that are not discussed in this article because we only focus on emotion recognition from speech data. As it turns out, all of these databases suffer from data shortages due to the lack of data samples and are not suitable for deep neural network training. As a suitable solution, we suggest creating a synthetic dataset using GANs trained by available datasets. We chose EMO-DB for the reason that this dataset contains less training samples as compared to the remaining datasets. Another reason for our choice was to compare the results of our proposed data augmentation method with other data augmentation methods.

Experimental setup
It is challenging to train the generators with high-dimensional feature vectors. To address this issue, we pre-trained both G i and F i generators based on the reconstruction error between S and F i G i (S) and also the reconstruction error between T i and G i (F i (T i )).
We used the OpenSMILE software to extract the features, and then used the proposed method to generate new feature vectors to increase the number of training samples and to balance the number of samples in the dataset. The dimension of the feature vector is 2185 for each training sample.
DNN with two hidden layers and 800 hidden neurons was used in the proposed cycle-GANs. We used ResNet for the generator network and PatchGAN for the discriminator network. In addition, DNN and SVM networks were used as classifiers, and leaky ReLU was applied to all the layers. The linear kernel is used in the SVM classifier. We also used the Xavier Algorithm [33] and the Adam optimizer [34] with a learning rate of 0.0002 which was reduced every 50 epochs. DNNs were implemented using Tensor-Flow (V2.1) in Python, while SVMs were implemented using Scikit-Learn Package.
To balance the training of G and D, the generator weights were updated two times per epoch, and the discriminator weights were updated one time per epoch. Moreover, unilateral label smoothing [35] was used to reduce the vulnerability

Results
The augmented data were gradually and randomly added to the original data, and two DNN and SVM classifiers were used for SER. The L 2 regulation was used to train deep neural networks, and each experiment was repeated three times, and the mean absolute accuracy was reported as the performance measure. Figure 5 shows the UAR results of the SVM and DNN on the EMO-DB dataset.
We compared the performance of the proposed method with those of the standard data augmentation techniques, such as sample reproduction, adding random noise to feature vectors and artificial sampling (SMOTE) [36]. The performance of data augmentation via adding noise depends on the amount of noise, and the results may not be stable, as shown in Fig. 6.
Generating synthetic data similar to the primary samples helps deep neural networks learn data distribution better; however, repetitive samples will not lead to better network training. The SMOTE method is designed to augment samples in one class and cannot be used to augment samples in  all classes, but it has a relatively stable performance [36]. Figure 7 shows the results of this method.
The cycle-GAN-based data augmentation method could also lead to the improvement of SVM performance. Figure 8 shows the performance of two classifiers by combing real and augmented data based on a cycle-GAN. The results show that augmenting data helps SVM recognize metadata in feature space and classify them with better performance.
According to Fig. 9, it is possible to improve the performance of the data augmentation approach based on the Wasserstein distance introduced in Sect. 3. The unweighted average recall is gradually augmented by adding artificial samples to the training set. These results show that data augmentation based on cycle-GANs may generate new and meaningful emotional vectors which help improve the performance of the emotion classifier. Figure 10 shows the clusters created by the cycle-GAN using the Wasserstein distance for the five emotional classes. In Fig. 10b, artificial samples generated from the proposed method completely cover the feature space for each emotion class, while the samples generated from an ordinary cycle-GAN in Fig. 10a cover only part of feature space of the actual data.
We compared our method with the methods in [19,36,37] in Table 1. This table shows that with the proposed method, the classifier can be better trained and our method outper-   forms [38] with the handcrafted features. Our method also outperforms Chen et al. [37], who used 3D-ACRNNs to extract features.

Conclusion
We have presented a method for generating synthetic samples based on cycle-GAN to mitigate data scarcity and to improve speech emotion classification. We generated artificial data in the space of each emotional class that completely covers the leading data space. We showed that using the Wasserstein divergence can overcome the vanishing gradient problem during the training process. The results show that including synthetic samples in the real samples can improve the emotion recognition performance to as high as 83.33% in terms of UAR on the EMO-DB dataset. As explained, we only dealt with the case where the features were extracted by OpenSMILE software and this can be extended by providing raw data to the network.