Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning

We propose a novel transfer learning method for speech emotion recognition allowing us to obtain promising results when only few training data is available. With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data. Our method leverages knowledge contained in pre-trained speech representations extracted from models trained on a more general self-supervised task which doesn't require human annotations, such as the wav2vec model. We provide detailed insights on the benefits of our approach by varying the training data size, which can help labeling teams to work more efficiently. We compare performance with other popular methods on the IEMOCAP dataset, a well-benchmarked dataset among the Speech Emotion Recognition (SER) research community. Furthermore, we demonstrate that results can be greatly improved by combining acoustic and linguistic knowledge from transfer learning. We align acoustic pre-trained representations with semantic representations from the BERT model through an attention-based recurrent neural network. Performance improves significantly when combining both modalities and scales with the amount of data. When trained on the full IEMOCAP dataset, we reach a new state-of-the-art of 73.9% unweighted accuracy (UA).


Introduction
Emotion recognition has been gaining traction for the last decade with the interest of providing conversational agents with high EQ when interacting with users. Such systems have applications in healthcare, non-invasive mental health diagnostics and screening, automotive, as well as education. Recognizing emotions has been a challenging task in the speech domain, mainly due to the lack of large-enough labeled datasets to successfully apply proven deep learning techniques, like it has been done in tasks with more resources such as Automatic Speech Recognition (ASR) [7,3]. This is even more problematic for non-English languages with fewer resources. It is thus important to find alternative ways to train accurate classifiers in situations where data is scarce. Moreover, emotion recognition is a data problem where having a finer emotion classification system leads to datasets of uneven sample sizes or data with multiple labels.
There have been various lines of work attempting to train accurate emotion classifiers with a low amount of labeled data. Some works have proposed to impose stronger restrictions on convolutional layers to better fit raw speech data and prevent overfitting on small data [14]. Multi-task learning has also been proposed as a way to mitigate overfitting to a small dataset by simultaneously learning different tasks [21,24]. However, these approaches still require a large enough amount of data to be able to learn efficient filters from scratch or to train all adjacent tasks.
Transfer learning is a growing area of research in deep learning and has the potential to help alleviate this problem of label scarcity. Particularly re-using pre-trained models or representations trained on more general tasks has been successfully applied in other domains than speech [5]. In computer vision, it is now standard practice to train a deep convolutional model first on the large ImageNet classification dataset and then re-use those weights for fine-tuning on a different vision task where less labeled data is available [9]. In natural language processing, the pre-training of large language models has been widely adopted and led to greatly superior results on many tasks [13,4].
Despite the clear benefits of pre-training in computer vision and natural language processing, this approach has not yet taken off in the speech domain, mainly because it is still unclear which task is the most suitable for pre-training. Yet recent works have started to show promising results, notably the wav2vec model [17] which helped reach a new state-of-the-art for speech recognition by training rich representations from unlabeled data through a contrastive-type of loss. More specifically, for the task of emotion recognition, [10] obtained strong results by re-using representations from a pre-trained ASR model. However, this approach still requires a large amount of labeled data to first train a good speech transcription model, which makes it impracticable for languages or domains where data is scarce.
In this work, we focus on unsupervised pre-training, which does not require labeled data for learning pre-trained representations, and as such we decide to re-use the wav2vec representations to apply for emotion recognition. We also experiment with combining two different kinds of pre-trained representations for both speech and text. The addition of BERT representations allows us to improve our accuracy by almost 10%.

Speech Emotion Recognition
The current standard approach to classify emotions from speech is to first extract low-level features, often called low-level descriptors (LLDs) from short frames of speech of duration ranging from 20 to 50 milliseconds, denoted x 1:T = (x 1 , x 2 , ..., x T ), where x i ∈ R n , n being the number of features. A high-level aggregation transformation is then applied to convert these frame-level features to an utterance-level representation x utterance ∈ R n . Finally, a softmax layer is applied on top of this new global representation to classify the utterance along the possible emotion classes.
Different methods have been used to transform the variable-length sequence of low-level descriptors to a fixed-length representation at the utterance level. Traditionally, statistical aggregation functions were applied to each of the LLDs over the duration of the utterance such as mean, max, variance, etc. These were concatenated into a long feature-vector to represent a single utterance. With the emergence of neural networks as the preferred approach, there have been experiments with various strategies such as recurrent neural networks or attention-based models [11].

Acoustic features
Traditionally most works use hand-crafted features that have been proven to work well for other speech-related tasks. Commonly used features include Mel-frequency cepstrum coefficients (MFCCs), zero-crossing rate, energy, pitch, voicing probability, etc. Another line of work has been focused on using raw spectrogram magnitudes, log-Mel spectrograms, or even raw waveform [18,15]. Yet although this works well for tasks where data is sufficient like Automatic Speech Recognition, these methods are still difficult to apply for emotion recognition due to a lack of labeled data.
We suggest that working with pre-trained features trained on raw waveform or raw spectrograms could help alleviate this lack of data, and we suggest using wav2vec features that were trained on large data in a self-supervised fashion.

Pre-trained Wav2vec Representations
The wav2vec [17] model is made of two simple convolutional neural networks: the encoder network and the context network. The encoder network f : X → Z takes raw audio samples x i ∈ X as input and outputs low frequency feature representations (z 1 , z 2 , . . . , z T ) which encode about 30ms of 16kHz audio every 10ms. The context network g : Z → C transforms these low-frequency representations into a higher-level contextual representation c i = g(z i . . . z i−v ) for a receptive field v. The total receptive field after passing through both networks is 210ms for the base version and 810ms for the large version.
The model was trained on about 1,000 hours of unlabeled English speech with a noise contrastive binary classification task, where the objective was to distinguish true future samples from distractors. The motivation behind wav2vec was to learn effective pre-trained representations to be used for Automatic Speech Recognition (ASR). Training a model on the resulting representations allowed the team to outperform Deep Speech 2 [1], the best reported character-based system at the time while using two orders of magnitude less data.
Our assumption is that these representations not only contain relevant information for speech recognition but also para-acoustic information that could be useful for detecting emotions in speech.

Model Architectures
We compare our approach with a baseline of commonly-used hand-engineered features to see how much more emotional information is contained in the pre-trained representations. To better understand the differences between both kinds of features, we experiment with various model architectures described below and in Figure 1. Due to our experiments on a very small amount of data, we selected simpler models rather than more complex recurrent neural networks to avoid overfitting. However, we do compare our solution to more advanced baselines trained on more data in Section 4.
(a) Mean pooling: this approach simply averages each feature along the time dimension. We make the simplifying assumption that the emotional information is constant-enough along time to meaningfully represent the utterance-level overall emotion. We use short-enough audio segments to make sure this is the case.
(b) Mean-max pooling: similar to mean pooling except that we leverage both the mean and the max of features across the time dimension. We then concatenate both vectors before feeding it to the softmax layer. This approach effectively doubles the number of parameters of our model from just mean pooling.
(c) Attention pooling: the model learns by itself a weighted average whose weights are determined by a simple attention mechanism based on logistic regression. This allows the model to learn the most efficient pooling function for the task, including the mean pooling approach above.
(d) MLP with pooling: this last approach allows the model to learn more complex frame-level features before the pooling operation. We build a Multi-Layer Perceptron (MLP) on top of individual frame-level features x 1 , x 2 , . . . , x T before mean pooling across the time dimension. The MLP parameters are shared across time.
We scale the number of layers and units to match the model capacity for the two types of features in order to compensate for the smaller dimension of LLD feature vectors compared to wav2vec representations.

Bimodal Emotion Recognition
Several works have explored combining information from linguistic features and acoustic features. Yoon et al. [23], Heusser et al. [8] combined utterance-level information from audio and textual embeddings before classifying through the last softmax layer. Lu et al. [10] also used a similar approach to ours except that they used pre-trained representations from an ASR model, which thus already contains semantic information.
In this work, we decide to use the same model as in [22], where we align both audio and textual pre-trained representations through an attention mechanism on top of a bidirectional recurrent neural network. The only difference is the replacement of hand-engineered features by wav2vec embeddings and of textual GloVe embeddings [12] by BERT embeddings. BERT [4] is a Transformer [19] encoder-only model which has been shown to capture meaningful information for text classification tasks. Specifically, it is able to encode word embeddings which take the context into account.

Experimental Setup
To compare emotion recognition performance between traditional features and embeddings from self-supervised pre-trained models, we use the IEMOCAP dataset [2], which contains 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. We use scripted and improvised dialogs from all 5 sessions. We only use the audio (and transcriptions for the bi-modal experiment) for our evaluation. Similarly to previous works on IEMOCAP, we only use 4 emotion classes (neutral, happy, sad, and angry) where at least 2 annotators agree. We also merge excitement together with happiness, resulting in a total of 5,531 utterances for the full dataset. We evaluate our model using speaker-independent 5-fold cross-validation: for each split, 4 sessions are used for training, and the remaining one is used for testing. We report our model's accuracy as the average of the accuracy on the validation set over all folds. This is consistent with previous works.
We use the pyAudioAnalysis library [6] to extract a set of 34 commonly-used features as in [22]. For wav2vec representations, we use the large version of the model since they have a wider receptive field which is more likely to contain emotion-salient features. For both audio representations, we crop all utterances to a maximum duration of 5 seconds. For BERT embeddings, we use the base model and extract embeddings from the provided IEMOCAP transcription files with the HuggingFace library [20]. We take embeddings from the second-to-last layer since these contain more general information and are less tied to BERT's particular training objective.
We train our models with a small batch size of 16 and we use the Adam optimizer with learning rate 10 −4 . We don't use any form of early stopping. We add dropout regularization for all models after each layer.

Training on Few Data
We compare generalization performance in a setting where training data is very scarce. We use only 500 examples for training, 125 per class. By using only a simple mean pooling and a softmax layer, we were able to reach 56.7% unweighted accuracy (UA) thanks to the wav2vec representations. This is 5.2% more than the same model trained on standard hand-engineered features. Table 1 report our results across different models and compare them with hand-engineered features. Our best model was able to reach an unweighted accuracy of 58.5%, higher than the same model trained on 8 times more data using hand-engineered features.
Interestingly, our results almost match a more advanced Bi-RNN model [11], trained on the entire dataset, that allows each frame to attend to the entire utterance (see Table 2). In comparison, wav2vec representations only have a receptive field of about 810ms, less than 1/6 of the maximum utterance length of 5 seconds.

Performance Scaling
In a second phase, we progressively increase the amount of training data on our best model (MLP with pooling). Figure 2 shows the evolution of the unweighted accuracy performance as we increase the amount of data. The UA increases in a log fashion with respect to the number of training examples.
Starting from 2,000 examples, our approach strongly outperforms other acoustic-only baselines that we benchmark against (see Table 2). The models we compare with were not only trained on more than twice the data, but they also present much more complex architectures compared to our simple MLP with pooling.

Bi-Modal Transfer Learning
Last, we experiment with combining pre-trained embeddings for both audio and text. We align wav2vec representations and sub-words embeddings from BERT through an attention-based recurrent neural network to align both representations in time, similar to [22]. The resulting model is much larger than previous ones, and to avoid over-fitting we only train it on the full dataset. We report the unweighted accuracy in Table 2, where our approach outperforms other models and reach a new state-of-the-art unweighted accuracy of 73.9%.  Table 2: Performance of our acoustic-only and bi-modal models when scaled on the full IEMOCAP dataset. We report results from previous state-of-the-art and often-cited works for comparison.

Model
Features UA

Conclusion
In this paper, we compare performance for emotion recognition when using pre-trained embeddings learned in a self-supervised setting. We demonstrate the superior performance and sample-efficiency of our technique compared to identical models with commonly-used hand-engineered features. Our model is able to reach a higher accuracy with 8 times less data than if it was trained from scratch in a supervised setting.
We report performance as we scale up the training data, and build a final model on two modalities: audio and text. Both modalities use pre-trained features from self-supervised models, and we reach a new state-of-the-art unweighted accuracy of 73.9%.