Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Speech Emotion Recognition using Data Augmentation Method by Cycle-Generative Adversarial Networks

Version 1 : Received: 22 April 2021 / Approved: 26 April 2021 / Online: 26 April 2021 (10:49:55 CEST)

A peer-reviewed article of this Preprint also exists.

Shilandari, A., Marvi, H., Khosravi, H. et al. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks. SIViP 16, 1955–1962 (2022). https://doi.org/10.1007/s11760-022-02156-9 Shilandari, A., Marvi, H., Khosravi, H. et al. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks. SIViP 16, 1955–1962 (2022). https://doi.org/10.1007/s11760-022-02156-9

Abstract

Nowadays, and with the mechanization of life, speech processing has become so crucial for the interaction between humans and machines. Deep neural networks require a database with enough data for training. The more features are extracted from the speech signal, the more samples are needed to train these networks. Adequate training of these networks can be ensured when there is access to sufficient and varied data in each class. If there is not enough data; it is possible to use data augmentation methods to obtain a database with enough samples. One of the obstacles to developing speech emotion recognition systems is the Data sparsity problem in each class for neural network training. The current study has focused on making a cycle generative adversarial network for data augmentation in a system for speech emotion recognition. For each of the five emotions employed, an adversarial generating network is designed to generate data that is very similar to the main data in that class, as well as differentiate the emotions of the other classes. These networks are taught in an adversarial way to produce feature vectors like each class in the space of the main feature, and then they add to the training sets existing in the database to train the classifier network. Instead of using the common cross-entropy error to train generative adversarial networks and to remove the vanishing gradient problem, Wasserstein Divergence has been used to produce high-quality artificial samples. The suggested network has been tested to be applied for speech emotion recognition using EMODB as training, testing, and evaluating sets, and the quality of artificial data evaluated using two Support Vector Machine (SVM) and Deep Neural Network (DNN) classifiers. Moreover, it has been revealed that extracting and reproducing high-level features from acoustic features, speech emotion recognition with separating five primary emotions has been done with acceptable accuracy.

Keywords

speech processing, data augmentation, speech emotion recognition, generative adversarial net-works

Subject

Engineering, Electrical and Electronic Engineering

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.