This section discusses the gap in using zero shot classification
in complex emotion detection. First, it reviews studies about complex emotions.
Second, it reviews the zero-shot classification. The gap between these two will
set up the scope for this study, on which a research hypothesis will be constructed
to be tested in later sections.
2.1. Complex Emotions
Current literature agrees that basic emotions are innate
and universal, automatic and fast, and trigger behaviour with a high survival value
(Cowen & Keltner, 2017). During the 1970s, psychologist
Paul Eckman identified six basic emotions that he suggested were universally experienced
in all human cultures. The emotions he identified were happiness, sadness,
disgust, fear, surprise, and anger (Ekman, 1993).
Plutchik, later, extended that to a set of eight basic primary emotions that include
joy, trust, fear, surprise, sadness, anticipation, anger and disgust (Plutchik,
2001). Those Ekman and Plutchik’s emotions can be observed
from any human facial expression and their messages, social networks posts and tweets
(Truong, 2022). On detecting those basic emotions, text-based emotion recognition
is a sub-branch of emotion detection that focuses on extracting fine-grained emotions
from written texts. Researchers have worked on
different datasets which include the textual form of simple sentences, tweets, and
dialogues to detect emotions (Kamath et al., 2022a).
Recent psychological discoveries have introduced novel
conceptual and methodological ways to capture the more intricate "semantic
space" of emotion by analyzing the distribution of emotional reactions to various
stimuli using computer tools. Alan S. Cowen and Dacher Keltner from the University
of California, Berkeley, identified 27 distinct categories of emotions in a study
(Cowen & Keltner, 2017). They collected 2,185 short videos to elicit specific
emotions, and the researchers then analyzed the responses. The list of 27 emotions
is not exhaustive, as each emotion can be a combination of different percentages.
The researchers created an interactive map to display the categories and their impact
on reactions. Some emotions, such as anger, may be ostensible reactions that obscure
the true feelings. For example, anger may be a manifestation of fear, while hate
and resentment can be traced back to other emotions (Cowen & Keltner, 2017).
Cowen’s emotions include admiration, amusement, anger, annoyance, approval, caring,
confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment,
excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization,
relief, remorse, sadness, surprise and neutral (Cowen & Keltner, 2017).
Based on that psychological study results, Stanford
researchers published a paper in 2020 called GoEmotions to create a granular taxonomy
for text-based emotion recognition and investigate the dimensionality of language-based
emotion space (Demszky et al., 2020). GoEmotions is the largest emotion dataset
available, containing 58k labelled data, based on 27 emotions and neutral. In specifics,
it is a dataset of fine-grained emotions that consists of 58k Reddit comments extracted
from popular English-language subreddits and labelled with 27 emotion categories.
The authors demonstrate the high quality of the annotations via Principal Preserved
Component Analysis and conduct transfer learning experiments with existing emotion
benchmarks to show that the dataset generalizes well to other domains and different
emotion taxonomies (Kamath et al., 2022a).
GoEmotions is designed to provide a strong baseline
for modelling fine-grained emotion classification. The dataset is built manually,
making it the largest human-annotated dataset, with multiple annotations per example
for quality assurance (Yusifov & Sineva, 2022). Previous datasets come from
the domain of Twitter, given its informal language and expressive content, such
as emojis and hashtags (Truong, 2022; Truong et al., 2020). Other datasets annotate
news headlines, dialogues, fairytales, movie subtitles, sentences based on FrameNet,
or self-reported experiences. The authors build on existing methods and findings
to devise a granular taxonomy for text-based emotion recognition and study the dimensionality
of language-based emotion space (Cowen & Keltner, 2017). They also use feature-based
and neural models to build automatic emotion classification models, demonstrating
the potential for further advancement in understanding emotion expression in language.
By fine-tuning a BERT-base model, the authors achieve an average F1-score of.46
over the taxonomy, .64 over an Ekman-style grouping into six coarse categories,
and .69 over a sentiment grouping. These results leave much room for improvement,
showcasing that this task is not yet fully addressed by current state-of-the-art
natural language processing models (Demszky et al., 2020).
Since its introduction, the GoEmotions dataset has been
used for various natural language processing tasks such as building empathetic chatbots
and detecting harmful online behavior. It also drew some attention from the research
community. For example, in Alvarez-Gonzalez et al. (2021)’s work, the authors analyze
the limits of text-based emotion detection on the two largest now-available corpora:
GoEmotions (58k Reddit comments tagged with possibly multiple labels out of 28 emotions,
annotated by third-person readers) and Vent (33M messages tagged with one out of
705 emotions by their original first-person writers) (Alvarez-Gonzalez et al., 2021).
The datasets make them suitable to study textual emotion detection at scale from
different perspectives. The authors focus on categorical approaches with recent
emotional taxonomies covering a rich spectrum of emotions from the perspectives
of senders and receivers. Emotion detection text corpora are used to build and evaluate
emotion detection systems, with early works like SentiStrength and ANEW using lexical
associations for sentiment analysis (Hutto & Gilbert, 2014). Sophisticated rule-based
models like VADER rely on human-annotated word signals, LIWC, EmoLex, and LIWC.
The authors also discuss the limitations of text-based emotion detection systems
and the NLP approaches that may be used to implement them. The results suggest that
emotions expressed by writers are harder to identify than emotions that readers
perceive (Alvarez-Gonzalez et al., 2021).
Yusifov’s study aims to use classical machine learning
algorithms to train a model capable of recognizing emotions in a text with accuracy
as close as possible to transformers (Yusifov & Sineva, 2022). The study aims
to simplify the classification step given by Google researchers and use classical
machine learning methods. The dataset was created using the results of experiments
with 82 participants. About 1% of all annotations were marked as unclear. Consistency
among the evaluators was analyzed, and it was found that in 92% of the examples,
2 or more evaluators agreed on at least one emotion label. The log odds ratio of
the i-th word being in the j set of emotion words was calculated, allowing for a
table showing the degree to which each word belongs to a particular set of emotion
words. The study focuses on emotion classification using a dataset of over 200,000
annotations. The most popular words for each emotion category are described accordingly
(Yusifov & Sineva, 2022). This work was another contribution to revising the
dataset, but did not provide an alternative to increase the accuracy for the model.
Zanwar’s study aims to improve the generalizability
of text-based emotion detection by leveraging transformer models with psycholinguistic
features (Zanwar et al., 2022). The authors propose approaches for text-based emotion
detection that leverage transformer models (BERT and RoBERTa) in combination with
Bidirectional Long Short-Term Memory (BiLSTM) networks trained on a comprehensive
set of psycholinguistic features (Truong et al., 2019). The proposed hybrid models
improve the ability to generalize to out-of-distribution data compared to a standard
transformer-based approach. The authors evaluate the performance of their models
within-domain on two benchmark datasets, GoEmotion and ISEAR, and conduct transfer
learning experiments on six datasets from the Unified Emotion Dataset. Their study
demonstrates that the proposed hybrid models outperform pre-trained transformer
models and improve the generalizability of emotion classification across domains
and emotion taxonomies (Zanwar et al., 2022). The study contributes to the advancement
of emotion detection models in real-world sentiment and emotion applications by
constructing a unified, aggregated emotion detection dataset that encompasses different
domains and annotation schemes (Zanwar et al., 2022). It is however still use two
BERT models, and provide no significant improvement relating to the accuracy.
Similaryly, Kamath et al. (2022a) presented an enhanced
context-based emotion detection model using RoBERTa, a fine-tuned RoBERTa model,
and a GoEmotions dataset (Kamath et al., 2022b). The approach combines a pre-trained
RoBERTa model with a GoEmotions dataset. Their paper reviews previous attempts to
create an emotions dataset and model, focusing on the state-of-the-art model. The
paper's findings are presented in the paper, which aims to improve the performance
of emotion detection models in various NLP tasks, such as semantic and propaganda
analysis (Kamath et al., 2022a). The model was tested on three different emotion
taxonomies and yielded desirable results, with a higher Macro-F1 score than the
model originally being used but at 0.56, it still needs quite a lot of improvement
(Kamath et al., 2022a).
Papers with code is a community-driven platform for
learning about state-of-the art research papers on machine learning. It provides
a complete ecosystem for open-source contributors, machine learning engineers, data
scientists, researchers, and students to make it easy to share ideas and boost machine
learning development. The latest version of Papers With Code has added 950+ unique
machine learning tasks, 500+ State-of-the-Art result leaderboards and 8500+ papers
with code. Papers with code keeps track of the fine-tune models using GoEmotions
in
https://paperswithcode.com/sota/text-classification-on-go-emotions. By August,
2023, there was 5 models listed there. The highest accuracy is 0.589. Definitely,
it needs to a lot of improvement there.
Among those 5 models on Papers with code, two were using
bert-base-uncased, one was using distilbert, one was using roberta, and one is using
electricidad. Regarding the techniques, they were all using text-classification.
None of them is using Bart, or zero-shot-classification. That leads us to review
the studies and how Bart and zero-shot-classification can do.
2.2. Sequence Classification
Text classification is a common NLP task used to solve
business problems in various fields. It categorizes or predicts unseen text documents
using supervised machine learning, similar to tabular dataset classification algorithms.
The main difference is the text involved in text classification. Text classification
traditionally utilizes supervised machine learning for ticket routing, automatically
tagging incoming messages based on topic, language, sentiment, and intent, and directing
them to the right customer support team based on their expertise (Xian et al., 2016).
There are two types of classification: supervised and unsupervised. Supervised classification
allows users more control by selecting training data and assigning them to correct
classes, while unsupervised classification is automated and requires no user input
(Xian et al., 2016).
Unsupervised text classification approaches aim to categorize
text without using annotated data during training, potentially reducing annotation
costs (Xian et al., 2016). There are two main categories: similarity-based approaches,
which generate semantic embeddings of texts and label descriptions, and zero-shot
learning, which uses labelled training instances to predict unseen classes. These
techniques use labelled data for training but do not require fine-tuning on labelled
data from target classes (V. N. X. Truong, 2016). Token classification refers to
the classifications of tokens in a sequence. So for example you assign classes to
words in a sentence. In sequence classification you’re classifying the whole sequence,
for example assigning a class to a sentence. Pretrained zero-shot text classification
models are considered unsupervised text classification strategies for that reason
(Lewis et al., 2019).
Previous studies have used supervised classification
techniques in their fine-tuning with the GoEmotions dataset, but achieved quite
low accuracy (Kamath et al., 2022b). Unsupervised classification is a technique
that identifies important sections of the text and generates them verbatim producing
a subset of the sentences from the original text. Conventional text classification
methods work by taking the text, ranking all the sentences according to the understanding
and relevance of the text, and presenting you with the most relevant classification.
This method does not create new words or phrases; it just takes the already existing
words and phrases and presents only that (Devlin et al., 2018).
Many models perform classification using machine learning
transformers. Bert and Roberta are the two examples of supervised classification.
BERT is a transformer-based model that was introduced by Devlin et al. (2018). It
was designed to pre-train deep bidirectional representations from the unlabeled
text by joint conditioning on both the left and right context in all layers. BERT
has since become one of the most popular transformer-based models for natural language
processing tasks (Yusifov & Sineva, 2022). EmoRoBERTa is an enhanced emotion
detection model using RoBERTa. It is an attempt
to build a more robust emotion detection model that can be implemented in various
NLP tasks such as semantic and propaganda analysis that involve the heavy usage
of emotions (Kamath et al., 2022a).
On the other hand, unsupervised classification is a
natural language technique that generates a more “human” friendly classification
by interpreting and understanding the important aspects of a text. It creates new
sentences and guesses the meaning of the whole text, making it more complex and
computationally expensive (Gera et al., 2022).
As one example, Zero-Shot Classification is a transfer
learning method that uses a pre-trained language model to predict a class that was
not seen during training (Fu et al., 2018). This method is useful for situations
with small, labelled data. The model is provided with a prompt and a sequence of
text to describe the desired task in natural language. Zero-Shot Classification
excludes any examples of the desired task being completed, unlike single or few-shot
classification, which includes only a few examples. This feature is emergent in
large language models, with effectiveness scaling with model size. Larger models
with more trainable parameters or layers generally perform better at zero, single,
or few-shot tasks (Xian et al., 2016).
Wang et al. (2018)’s paper presents a novel approach
to zero-shot recognition, focusing on learning a visual classifier for a category
with no training examples using word embeddings and its relationship to other categories.
The approach builds upon the Graph Convolutional Network (GCN) and uses semantic
embeddings and categorical relationships to predict classifiers. The learned knowledge
graph (KG) is used to input semantic embeddings for each node, and a series of graph
convolutions predict the visual classifier for each category. During training, visual
classifiers for a few categories are given to learn GCN parameters, and at test
time, these filters are used to predict unseen categories. The approach is robust
to noise in the KG and significantly improves performance compared to current state-of-the-art
results, ranging from 2% on some metrics to 20% on a few (Wang et al., 2018).
Deep convolution neural networks have made significant
progress in supervised recognition tasks, but scaling recognition to large classes
with limited training samples remains a challenge. One approach is zero-shot recognition,
which involves developing models that recognize unseen categories without training
instances. Fu et al. (2018)’s article reviews existing zero-shot recognition techniques,
including representations, datasets, and evaluation settings. It also discusses
related recognition tasks like one-shot and open-set recognition, which can be used
as extensions of zero-shot recognition when limited class samples become available
or when implemented in real-world settings(V. Truong, 2016). The article highlights
the limitations of existing approaches and suggests future research directions in
this new research area including More Generalized and Realistic Settings, Combining
Zero-shot with Few-shot Learning, Beyond object categories and Curriculum learning
(Fu et al., 2018).
(Puri & Catanzaro, 2019)’s study explores the use
of natural language for zero-shot model adaptation to new tasks. It uses text and
metadata from social commenting platforms as a pretraining task and trains the language
model with natural language descriptions of classification tasks. This allows the
model to generalize to new tasks without multiple multitask classification heads.
The zero-shot performance of these generative language models, trained with weak
supervision, shows a 45% absolute improvement in classification accuracy over random
or majority class baselines. This suggests that natural language can serve as a
powerful descriptor for task adaptation, potentially leading to new meta-learning
strategies for text problems (Puri & Catanzaro, 2019).
Tesfagergish et al. (2022)’s paper presents a novel
sentiment analysis method for the English language, addressing the binary and three-class
sentiment analysis problems. The method is a two-stage classification problem, with
the first stage determining emotions and the second stage determining sentiments.
The core of the first stage is a zero-shot transformer model, which does not require
training and extracts probabilities of emotions for the given text. The second stage
converts the zero-shot classification results into a one-hot encoding vector and
trains a supervised machine-learning classifier. The researchers investigated various
machine learning methods, including traditional, deep learning, single-model, and
ensemble methods. The best accuracy was achieved with a set of 10 and 6 emotions,
respectively (Tesfagergish et al., 2022).
The best zero-shot model is bart-large-mnli, and the
best classifier is ensemble learning. The proposed method achieves a 44% improvement
compared to previous research, making it stable even with small training datasets.
The method reduces the effort of training vectorizers and the need for a large training
dataset. The simplified structure of the method can benefit under-researched languages.
The research validates the application of emotion detection in detecting sentiment
in given texts. Future research will focus on testing all possible emotions and
domain-dependent ones, as different emotions in different contexts and domains may
lead to different sentiments (Tesfagergish et al., 2022).
A simple self-training approach is proposed to bridge
the gap in text classification using class names and an unlabeled dataset. Fine-tuning
the zero-shot classifier on its most confident predictions leads to significant
performance gains across various tasks, as self-training adapts the model to the
task as shown in previous studies. All the previous studies have shown that zero
shot classification brought a better result than the conventional ones in the fields
of news and documentation. At the same time, studies on emotions used conventional
ones only and achieved quite low results. There is a gap in understanding the effectiveness
of using zero-shot classification in complex emotion detection. This study, therefore,
hypothesizes that:
Hypothesis: Using zero-shot classification in complex emotion detection is significantly more effective than the conventional text classification