Preprint
Article

This version is not peer-reviewed.

Explainable Transformer Models for Human Emotion Recognition: A Multi-Method Explainability Study in the Context of Mental Health

Submitted:

17 April 2026

Posted:

21 April 2026

You are already at the latest version

Abstract
Recognizing emotions from written text is a very important part of Natural Language Processing (NLP) and is commonly used for feeling or sentiment analysis or keeping track of someone’s mental health status. This study uses a readable emotion-detecting framework with a RoBERTa-base model that has been modified and trained specifically for the Emotions for NLP dataset and provides an accuracy of 0.924% and f1 score of 0.925%. The main contributions of this study are the use of four different techniques that will help understand how the model works: SHAP (SHapley Additive exPlanations) provides global token credit attribution; LIME (Linear Interpretable Model-Agnostic Explanation) provides instance-level explanations; multi-head Attention Visualization provides structural interpretability; and Integrated Gradients via Captum provides gradient-based attribution using integration. The combination of these four techniques works together to improve transparency, help identify bias in the models, and support the responsible use of this model. Finally, the developers of this model performed many experiments that demonstrated the consistency with which the model could identify important emotional tokens (words or phrases) as predictive indicators of emotion.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Natural Language Processing (NLP) helps identify emotional states such as joy, sadness, anger, fear, love, and surprise from text. The ability to automatically determine these emotions is becoming increasingly important in many real-world situations, especially those that include mental health support systems, customer sentiment analysis, social media monitoring, dialogue systems, and opinion mining [8,15]. With the enormous increase in user-generated text produced via platforms such as Twitter, Reddit, and online forums, the ability to accurately recognize emotions in text has become extremely important.
Handcrafted features (such as lexicons, n-gram models, and support vector machines [SVMs]) dominated emotion classification techniques; however, pretrained transformer language models, particularly Bidirectional Encoder Representations from Transformers (BERT) and its variant, Robustly Optimized BERT Pretraining Approach (RoBERTa), have reshaped NLP. By first leveraging large-scale unsupervised pre-training, followed by task-specific fine-tuning, these models learn deep contextual relationships between words that traditional methods are unable to discover. There are several differences between BERT and RoBERTa; in particular, RoBERTa outperforms BERT on many downstream tasks because of its enhanced training protocol (e.g., dynamic masking of text, longer training periods, and omission of the Next Sentence Prediction purpose).
The transformer models proved to be powerful with regard to making predictions; however, their internal decision-making process is very much a “black-box.” The black-box problem is a deterrent to their implementation in high-stakes settings (e.g., clinical decision support, legal sentiment analysis) because these applications require transparency and accountability of use [11]. The field of Explainable Artificial Intelligence (XAI) has arisen in response to the need for transparency and therefore provides post-hoc and intrinsic ways to provide insight into the workings or logic behind the decisions of complex models.
Most existing research on Explainable Emotion Recognition uses only one technique to provide an explanation for the model’s behavior; therefore, these techniques provide only partial views of the model’s behavior. For example, using the SHAP Method [4], researchers can provide global feature importance scores for the model based on Shapley values from game theory, whereas LIME [5] provides local, instance-level explanations for the model through surrogate linear models. The use of attention visualization [7] demonstrates the token pairs in the model’s input that the model focused on using the internal multi-head self-attention weights of the transformer. Through the Captum Library [10], Integrated Gradients (IG) [3] establish axiom-satisfying gradient-based attributions that can be considered more meaningful than the original gradient-based methods.
This study proposes and implements a new and comprehensive pipeline for explainable emotion classification by integrating all four complementary XAI methods into one framework. The pipeline uses a RoB-ERTa-base pre-trained model that was fine-tuned using the Emotions for Natural Language Processing (NLP) dataset (source: Kaggle) and validated using a held-out test dataset. The major contributions of this study are as follows:
(1)
A novel RoBERTa-based model for emotion classification achieved state-of-the-art classification performance on a public benchmark dataset with the following performance metrics: (i) 0.924% accuracy, (ii) 0.925% weighted F1-score, and (iii) 0.997% ROC-AUC across six emotion categories.
(2)
The first systematic multi-XAI comparative analysis that combines SHAP, LIME, Attention Visualization, and Integrated Gradients in a unified transformer emotion classification framework.
(3)
A before-and-after methodology was implemented using a rigorous analysis of each method’s scope of responsibility, theoretical basis, and additional contributions to the overall model interpretability.
(4)
All experiments, figures, and model weights will be publicly available for reproduction in a single Google Colab environment to facilitate transparency and reproducibility.
The rest of this paper is organized as follows—section two reviews the existing literature pertaining to emotion recognition and explainable artificial intelligence. In section three the materials utilized within this project; the data set utilized in this project and the proposed methodology. Section four presents the experimental results, while section five discusses the findings in relation to previous research reports. Finally, section six concludes the project with suggestions for additional research in this area.

3. Materials and Methods

This section describes the dataset, model architecture, training strategy, mathematical formulations, and XAI implementation details of the proposed framework for explainable emotion recognition.

3.1. Dataset Description

All experiments were performed using the Emotions for NLP dataset. Short English textual sentences comprise the dataset, which are annotated with one of the six mutually exclusive labels, that is, emotions: anger, fear, joy, love, sadness, and surprise. This representation allows the datasets to provide a balance between the positive and negative affect categories, demonstrating that the dataset is an appropriate basis for any multi-class emotion classification research. In addition, each sample follows a semicolon-delimited format, that is, text; label, thus allowing for unambiguous parsing when loading data. The dataset provides predefined training, validation, and test sets, allowing for complete reproducibility of the experimental evaluation of any independent study conducted on these dataset(s).
There were included 16,000 samples allocated to the training set, 2,000 to the validation set, and 2,000 to the test set for a total of 20,000 annotated instances in the complete Emotions-NLP dataset. The entire distribution of class labels for each sample in all three sets is presented in Table 1. It should be noted that the class distribution was highly imbalanced: joy was the largest class at slightly less than 30%, while the class was surprise at only 1.2%. The class distribution presented motivated the use of a weighted F1-score as the primary metric of evaluation, while additionally calculating class-based precision, recall, and F1 values, to ensure fair comparisons across all emotion classes.

3.2. Exploratory Data Analysis

Before training the model, exploratory data analysis (EDA) was performed to determine the statistical characteristics of the dataset and assist in decisions regarding how to preprocess it for use with deep learning. Measures of word length were computed for each sample as they fell in the corpus, including the mean (19.2 words), standard deviation (10.99), minimum (2 words), and maximum (66 words). Thus, the most frequently occurring sentence length was approximately 17 words, while the 75th percentile (or 75% of sentences would have less than or equal to) was approximately 25 words. Therefore, this confirms that almost all the samples in this dataset will fall within the maximum length of 128 tokens set forth to be used in this study. The above measured statistics demonstrate that if any words were truncated after tokenization once defined, this would only have a minor-to-inexistent effect on the semantic meaning for the majority of the training data.
The visualization panel summarized in the EDA format (Figure 2) is from the numerous experimental runs used to create the Colab notebook. The panel is made up of three individual plots: (a) The first plot illustrates the distribution of class across the six emotion categories, completely validating the finding of joy having many more observations than the remaining five categories (with surprise being overwhelmingly limited); (b) The second plot shows the distribution of text lengths as shown by word counts–the mean text length was 19.2 words, with most of the text being less than the average and being consistent with the majority of the data being social media length textual forms; and (c) The last plot was the size of the training, validation, and testing datasets, which were determined to have been split into 16,000, 2,000, and 2,000, respectively. Preprints 209019 i001Preprints 209019 i002
Table 2. Summary of regularization and overfitting prevention.
Table 2. Summary of regularization and overfitting prevention.
Regularization Technique Configuration Purpose
Dropout p = 0.1 on classification head Prevents co-adaptation of neurons
Weight Decay λ = 0.01 (AdamW) Penalizes large parameter weights
Gradient Clipping max_norm = 1.0 Prevents exploding gradients
Early Stopping Patience = 3 epochs on val F1 Halts training at optimal checkpoint
Linear LR Warmup 10% of total training steps Stabilizes early training dynamics
Best Checkpoint Saving Based on highest validation F1 Ensures optimal model is evaluated

3.3. Model Architecture

The system relies on ROBERTA-base [1] as its primary component. The ROBERTA-base model has 12 transformer encoder layers, each with 12 heads for multi-head self-attention, with a hidden dimension size of 768, totalling approximately 125 million trainable parameters. The additional amount of data and diversity used during pre-training created a richer, more transferable contextual representation than the original ENGLISH language BERT model, especially the implementation of Dynamic Masking (a process where a new token is drawn at every training epoch) and removing the next sentence prediction (NSP) objective, which Liu et al. showed to be a negative factor on the subsequent performance of downstream tasks, added to this improvement over the base BERT model. In summary, the refinements above have allowed for more robust and transferable contextual representation(s) than previous BERT models.
The modified pre-trained RoBERTa-base model was extended with a linear classification head for the task of fine-tuning on six-class emotion classification. This classification head has two components: a dropout layer where the dropout probability is set as p = 0.1, and a fully connected linear layer projecting the 768-dimensional encoding of the special [CLS] token (which provides the global sentence-level context) into a 6-dimensional output logit vector (1 logit/class of emotion). The output logit vectors were transformed using the softmax function to create the final normalized class probability distribution over the six emotion categories. The entire model (both the pre-trained RoBERTa encoder and classification head) was jointly fine-tuned end-to-end on the training data.
Figure 3 illustrates the high-level design of the end-to-end X-ER system connecting raw text input to the tokenizer, ROBERTA encoder, classification head, and four independent XAI analysis modules.

3.4. Training Configuration and Optimization

The model was optimized using Adaptive Moment Estimation with Weight Decay (AdamW). This optimizer extends the standard Adam optimizer by adding a decoupled weight decay regularization term (that is, a regularization term acting directly on the parameters rather than on the gradient update), allowing for more stable and appropriately regularized fine-tuning of large transformer (pretrained) language models. A learning rate of 2 × 10⁻⁵ was used during training. This is an appropriate value for many transformer fine-tuning tasks based on the existing literature. A weight decay factor of 0.01 was used for all parameters other than the bias and layer norm parameters to avoid overfitting.
A linear learning schedule using a linear warming period of 10% of the total training steps will be used here. Learning rates in the warming phase will linearly increase from 0.0 to a 2 * 10-5 learning rate, and then will again linearly reorder back to 0.0 over the remaining training steps. This approach provides improved stabilization of the training dynamics during the early days of the adjusted transformer models. Gradient clipping of an absolute value of 1.0 occurs at each training iteration to prevent exploding gradients from creating instability during training of deep transformer networks. The number of epochs for model training was set to a maximum of 10 and was stopped if the validation weighted F1-score did not improve for three epochs in a row. When stopping, the best performing checkpoint will automatically be saved to perform the final evaluation on the test set. All experiments were run using the same random seed (42) to enable full reproducibility. All hyperparameters used in the training are summarized in Table 3.

3.5. Training Algorithm

describes the complete training method for the proposed X-Emotion model. The algorithm captures the entire process from data loading and tokenization to training with early stopping after each epoch, including forward, loss, back-propagation, gradient clipping, parameter updates, and model checkpoints of the best models. Preprints 209019 i003

3.6. Mathematical Formulations

This subsection presents the five core mathematical equations that underpin the classification objective and the XAI attribution components of the proposed framework. Each equation is accompanied by a precise definition of all constituent terms.
Equation (1) — Softmax Classification Output:
The raw output logits z = [z₁, z₂, ..., z₆] produced by the linear classification head are normalized into a valid class probability distribution via the softmax function:
P ( y = k x ) = e x p ( z k ) j = 1 6 e x p ( z j ) , k { 1,2 , 3,4 , 5,6 } ( 1 )
where z k is the k-th logit output corresponding to emotion class k, and P ( y = k x ) denotes the predicted probability of class k given input text x. The predicted emotion label is obtained as ŷ = argmax . k   P y = k x . Equation (2) — Categorical Cross-Entropy Loss:
The model is trained by minimizing the categorical cross-entropy loss function over the training set:
L = 1 N i = 1 N k = 1 6 y i k l o g P ^ ( y i k ) ( 2 )
where N is the total number of training samples, y i k is the binary one-hot encoded ground-truth indicator for sample i and class k ( y i k = 1 if the true label of sample i is k, and 0 otherwise), and P ^ ( y i k ) is the model’s predicted probability for class k of sample i as computed by Equation (1).
Equation (3) — Integrated Gradients Attribution:
Integrated Gradients [3] attributes the model’s prediction to each input token embedding dimension d by integrating the gradient of the model output F(x) along a straight-line interpolation path from a zero-vector baseline embedding x′ to the actual input embedding x:
I G d ( x ) = ( x d x d ' ) α = 0 1 F ( x ' + α ( x x ' ) ) x d d α ( 3 )
where x d is the d-th dimension of the input token embedding, x d ' is the corresponding baseline dimension, α ∈ [0,1] is a scalar interpolation parameter that linearly traverses the path from baseline to input, and F (·) is the scalar model output (class logit) being attributed. In practice, the integral is approximated using the trapezoidal rule with m = 50 uniformly spaced interpolation steps, providing a numerically stable and computationally tractable estimate.
Equation (4) — SHAP Shapley Value:
The SHAP attribution value for input feature i is derived from the Shapley value formulation in cooperative game theory [4]:
ϕ i ( f , x ) = S F { i } S ! ( F S 1 ) ! F ! [ f ( S { i } ) f ( S ) ] ( 4 )
where F represents the entire collection of features that can be passed to any given model (i.e., the complete collection of tokens); S is any collection of features that does not contain feature i; f(S) gives us the predicted output of the model when only the features contained in S are available and all other features are excluded; and the combination of coalitions includes the number of ways that the coalition could have been created (i.e., how many different permutations of features exist for this group). Therefore, the SHAP value for feature i, ϕ i ( f , x ), measures the average contribution of feature i to the model predictions over all feature combinations and provides an acceptable attribution for feature i.
Equation (5) — Weighted F1-Score:
Given the pronounced class imbalance in the dataset, the weighted F1-score is adopted as the primary performance metric:
F 1 weighted = k = 1 6 n k N 2 Precision k Recall k Precision k + Recall k ( 5 )
where n k is the number of ground-truth samples belonging to class k in the test set, N is the total number of test samples.

3.7. Theoretical Guarantee: Completeness of Integrated Gradients

Integrated Gradients (IG) are underpinned by the Completeness Axiom of the Theory of Mathematical Guarantees, which is one of two Fundamental Axioms of Integrated gradients in attribution methods. The Completeness Axiom provides a mathematical proof of correctness that is absent from simpler alternative attribution methods based on gradients.
Theorem 1 (Completeness of Integrated Gradients [3]): Let F: ℝᵈ → ℝ be a differentiable model output function, x ∈ ℝᵈ be the input, and x′ ∈ ℝᵈ be the baseline. Then the Integrated Gradients attributions I G d (x) as defined in Equation (3) satisfy the Completeness Axiom:
d = 1 D I G d ( x ) = F ( x ) F ( x ' ) ( 6 )
That is, the sum of all token-level IG attributions exactly equals the difference between the model’s output at the actual input x and its output at the baseline x′.
Significance: The Completeness Axiom is an important assurance of authenticity and accuracy in regards to an IG attribution: An IG attribution calculation is not a heuristic estimate; it is mathematically limited and can only have an aggregated change in output equal to the change from baseline. The properties that are violated when using standard gradient methods as well as gradient x input, yield IG attributions to be more credible and dependable when it comes to performing a detailed model audit supporting regulatory compliance. In our case, the zero-filled vector acts as a baseline x’ (F(x′) ≈ 0) so that IG attributions computed across all token embedding dimensions account/address for the total prediction made by the model which ensures that the explainability analysis considers both all elements in the explanation and all.

3.8. XAI Methods Implementation

After the Fine-tuning of the RoBERTa model was completed, four XAI methods were implemented on the trained RoBERTa model. The four complementary methods were adopted iteratively, with each method covering a distinct scope of coverage and attribution paradigm to provide a holistic view of how the model arrived at its prediction or decision through multiple analytical lenses.
Using SHAP was very important for using shapExplainer, and this created a display that combined the HuggingFace pipeline wrapper, a fine-tuned RoBERTa model, and a unified prediction function to evaluate SHAP values for each instance from a stratified random sample of 100 test instances sampled proportionally to the six emotion classes, which showed a fair distribution of global attribution analysis on each of the emotion classes in the global attribution analysis of the SHAP values. The SHAP values for all test instances were aggregated to create a global bar chart that displays the mean absolute SHAP value ranked from highest to lowest according to the word token identifier of every word that was present in the entire test corpus. I also developed a SHAP waterfall plot relative to the total number of instance contributions towards the final classification of just one test instance to provide a visual of how instance-level token contributions cumulatively affected the instance’s final classification.
LIME was set up using a LimeTextExplainer to produce explanations based on 10 perturbed features per explanation and 500 perturbed samples for each instance, thereby balancing explanation fidelity with computational expense. LIME works by taking the input sentence and producing perturbed versions of the input by randomly masking a single word on each sample and then querying the RoBERTa model for predictions on each perturbed sample, which allows for the creation of a sparse weighted linear regression model that fits the local decision boundary. The instance-level LIME weight distributions were visualized for one emotion class from the six possible emotion classes, creating a panel of six bar charts that illustrated the contribution of context-dependent tokens.
To obtain Attention Visualization, attention weight tensors were extracted directly from the last (12th) transformer encoder layer of the RoBERTa model fine-tuned for our task. This was achieved using the output_attentions = True flag when making calls to the forward pass. The extracted attention tensor was of shape [heads x seq_len x seq_len], and after averaging all 12 head tensors, the average value across heads was calculated, resulting in one averaged head attention matrix. This matrix was then visualized as a token-to-token heatmap for the test sentence. The intensity of each cell within the matrix represents the size of the total attention score for the given token pair. Therefore, if there is a strong mutual connection through dark cells within the heatmap, there are strong attention connections between the token pairs during the final encoded representation of the model.
The Captum IntegratedGradients module was applied to the input token embedding layer of the fine-tuned RoBERTa model through the application of Integrated Gradients. The attribution baseline x′’ is a zero-vector embedding tensor of the same shape as the input. Fifty interpolation steps were taken within the attribution space from the baseline and entry embedding, with trapezoidal numerical integration performed using the method outlined in Equation (3) as follows: The resulting attribution tensors had a shape of [seq_len × embedding_dim] and were L2-normalized per token into scalars that could be visualized. The normalized token-level IG attribution scores are displayed as horizontal bar charts for one example per emotion class, with positive attribution bars colored teal and negative attribution bars colored coral. This allowed for good visual contrast between tokens contributing to and detracting from supporting the claimed emotion class.

4. Results

The experimental findings of the XAI-based emotion recognition framework are presented in the following sections. Various metrics (e.g., classification accuracy and confusion matrices) were analyzed based on both performance-based (e.g., SHAP and LIME) and human-based (e.g., attention visualization and integrated gradients) methods. A comparative analysis with traditional machine learning-based emotion recognition systems was conducted to assist the reader in understanding how the developed framework performs relative to previous models.

4.1. Training Dynamics and Convergence Analysis

The overall performance of the model was good during training. Therefore, the training process was completed in 10 complete epochs without having to stop early. All 10 epochs were built on each other and produced better validation results because the model improved on their respective validation datasets. Figure 4 shows how the training process developed over the completion of 10 epochs by providing three entirely differently synchronized subplots for the two datasets (Training and Validation) using three separate metrics: Cross Entropy Loss (see Eq. 2), Accuracy; and F1 Score (see Eq. 5).
The loss curves were stable and continually decreased without any type of divergence or overfitting demonstrated on either validation set. Loss of training decreased from approximately 0.85 at epoch 1 to less than 0.10 at epoch 10, while loss of validation experienced an absolutely close parallel trajectory indicating that regularization used (i.e., Dropout (p=0.0–0.9), AdamW Weight Decay (λ = 0.01), Gradient Clipping (Max Norm = 1.0), Linear LR Warmup) effectively controlled the Generalization Gap for the entire duration of training. In a formal way, the Generalization Gap at epoch t can be defined as follows:
Δ t = L val t L train t ( 7 )
The data obtained through multiple epochs support the view that this model does not exhibit overfitting to the training distribution because it is consistently close to zero and stable when viewed across time (near zero). The training of the model converged quickly after four epochs (i.e., a plateau of validation accuracy and weighted F1-score > 0.92). The best model checkpoint (epoch 10) was also found to have the highest weighted F1-score overall on the validation set (0.927) and was used in subsequent evaluations of the test set.

4.2. Overall Test Set Performance

After the model was finished being trained, the highest checkpoint from epoch 10 was loaded back into memory and then tested on 2,000 held out test samples. The overall performance measures are summarized below. Let T P k , F P k , F N k represent true positives, false positives, and false negatives, respectively, for a given class k. The formulas below calculate the per-class precision and recall:
Precision k = T P k T P k + F P k , Recall k = T P k T P k + F N k ( 8 )
The overall accuracy across all N = 2,000 test samples is computed as:
Accuracy = k = 1 6 T P k N ( 9 )
Across all classes combined, the model achieved an overall test accuracy of 92.4%, weighted precision of 92.5%, weighted recall of 92.4%, weighted F1 score of 92.5%, and macro average ROC AUC (area under curve of receiver operating characteristic) of 98.9%. The ROC AUC was computed using the one vs rest (OVR) multi-class method as two classes (K=6) were used as emotion classes and AUC ( ROC k )=Area of the receiver operating characteristic curve corresponding to class k against all classes other than k. The near perfect ROC AUC (99.7%) indicates the model’s probabilistic outputs were well calibrated and highly discriminative for emotion classes (including minority emotion classes).
ROC - AUC macro = 1 K k = 1 K AUC ( ROC k ) ( 10 )
Table 4 shows the classification metrics across individual emotion classes and Figure 5 visualizes the classification metrics grouped by precision, recall and F1 score for individual emotion classes

4.3. Comparative Analysis Against Baseline Models

To define and clearly explain how the performance of our RoBERTa-based framework compares with existing baselines, we used a variety of baseline models, including (i) traditional machine learning models, (ii) classical deep learning models, and (iii) other pretrained transformer architectures. For all baselines, we trained and evaluated them under identical experimental conditions (i.e., using the same dataset splits as well as preprocessing, training budget, and evaluation metrics) so that there was no bias between the compared models in this study as shown in Table 5 and Figure 6.
Table 5. Comparative performance of the proposed RoBERTa-base + multi-XAI framework.
Table 5. Comparative performance of the proposed RoBERTa-base + multi-XAI framework.
Model Accuracy Precision Recall F1-Score ROC-AUC Citation
TF-IDF + Logistic Regression 0.7680 0.767 0.768 0.765 0.908 [16]
TF-IDF + SVM (Linear) 0.7830 0.782 0.783 0.781 0.921 [16]
CNN + Word2Vec Embeddings 0.8120 0.811 0.810 0.809 0.941 [17]
BiLSTM + GloVe Embeddings 0.8360 0.834 0.833 0.832 0.953 [18]
DistilBERT (fine-tuned) 0.8840 0.883 0.884 0.881 0.991 [19]
BERT-base (fine-tuned) 0.9010 0.900 0.901 0.899 0.994 [20]
XLNet-base (fine-tuned) 0.9100 0.909 0.910 0.908 0.995 [21]
RoBERTa-base + Multi-XAI (Proposed) 0.9245 0.925 0.924 0.925 0.997 This work
Table 6. Computational cost and scalability comparison.
Table 6. Computational cost and scalability comparison.
XAI Method Computation Type Time per Sample Scalability Requires Model Access
SHAP Coalition sampling ~45–120 sec Low — O(2ⁿ) Black-box
LIME Perturbation sampling ~8–15 sec Medium — O(n·k) Black-box
Attention Visualization Single forward pass < 0.1 sec High — O(T²) White-box
Integrated Gradients m gradient computations ~2–5 sec Medium — O(m·d) White-box
Figure 6. Comparison of Accuracy and F1 Score across baseline.
Figure 6. Comparison of Accuracy and F1 Score across baseline.
Preprints 209019 g006
Figure 7. Performance trend of Accuracy and F1 Score across models.
Figure 7. Performance trend of Accuracy and F1 Score across models.
Preprints 209019 g007

4.4. Confusion Matrix Analysis

As illustrated in Figure 8, we can see the confusion matrix for the test dataset (raw counts) as well as the normalized confusion matrix (row-normalized) for the same dataset (see Appendix I for all the underlying values). Let C ∈ ℝ^ {6×6} be the confusion matrix where the entry C i j is equal to the number of test samples with true class i predicted to belong in class j. The normalized confusion matrix (C̃) is calculated as follows:
C ~ i j = C i j j = 1 6 C i j ( 13 )
such that each row sums to unity, enabling direct comparison of per-class recognition rates regardless of class frequency differences.

4.5. SHAP Explainability Analysis

In Figure 9 we can see the SHAP Global Feature Importance (GFI) bar chart with the tokens ranked by their mean absolute SHAP attribution value ϕ i over the full test set of 100 stratified samples.
ϕ ˉ i = 1 M m = 1 M ϕ i ( f , x m ) ( 14 )
where M = 100 is the number of test samples used to compute the global SHAP attribution analysis, and ϕ i (f, x^{(m)}) is the SHAP attribution value of token i for sample m defined in equation (4).
The token ‘cold’ had the highest average |SHAP| value of +0.84, making it the most globally predictive token of all time across the entire test dataset. Other high-ranking SHAP tokens included ‘outraged’ (mean SHAP = +0.80), ‘fearful’ (mean SHAP = +0.54), ‘hated’ (mean SHAP = +0.53), ‘really’ (mean SHAP = +0.51), and ‘irritated’ (mean SHAP = +0.47). The majority of lexically explicit emotional markers scored among the leading SHAP tokens, indicating that the model has adjusted its predictions to rely on semantically cohesive and linguistically interpretable features rather than positional, syntactic, or random correlate artifacts.
Figure 10 illustrates a single-sample SHAP Waterfall plot for a representative test instance. This plot illustrates how the individual token SHAP values cumulatively shift the output predicted by the model from its expected baseline of E[f(x)] = 0.999 to what the model outputted as the final prediction. The SHAP Waterfall graph breaks the output down into its component parts that contribute toward the final class prediction from the input.
F ( x ) = E [ f ( x ) ] + i = 1 T ϕ i ( f , x ) ( 15 )
where T (= number of tokens in the input sequence) and E[f(x)] are both calculated from the background reference distribution of the overall Shapley value of the model output). Tokens with a positive SHAP value (as represented by red bars) will push the overall output prediction of the model toward that class, whereas tokens with a negative SHAP value (as represented by blue bars) will depress the model’s prediction toward that class.

4.6. LIME Explainability Analysis

As depicted in Figure 11, there are six bar charts (one per emotional category) in the LIME local explanation panel showing the top ten most influential tokens identified from a particular representative sample of the emotional category. The LIME weight, which is assigned to token i from a particular sample, is calculated using coefficients from the fitted sparse linear surrogate model locally:
f ^ LIME ( z ) = i = 1 T w i z i ( 16 )
where z i = {0, 1} indicates whether token i is present in a perturbation of the original input or not, and w i   is the weight assigned to to-ken i in the sampled instance from the linear regression model through the minimization of a locally weighted least squares cost function. Tokens that were assigned positive weights, w i > 0 (blue bars), are indicators that the token contributes positively to the prediction of the respective emotion; whereas, tokens that were assigned negative weights, w i < 0 (red bars), are indicators that the token contradicts the predicted emotion.
In the ‘anger’ case, both ‘anger’ and ‘end had the highest total LIME weight, whereas in the ‘joy’ case, the main contributing terms were ‘optimistic’, ‘feeling’, and ‘arrived.’ The term ‘stunned’ had the highest LIME coefficient in the surprise analysis. One of the most interesting findings is that the LIME weight rankings on a sample-by-sample basis are often substantially different from the global SHAP token rankings. This indicates that the local importance of individual tokens is highly dependent on the specific context of each instance and that aggregated global SHAP rankings do not adequately account for the variation in the contributions of individual tokens at individual instances. This complementary relationship between SHAP and LIME is exactly what underlies the motivation for developing a multi-XAI framework, which is described in the following section.

4.7. Attention Visualization Analysis

As shown in Figure 12, a last-layer head-averaged attention heatmap for the sample test sentence “I’m feeling rather rotten so I’m not very ambitious right now” was obtained by averaging 12 attention heads (H=12) over H in the last encoder layer with respect to T terms in the encodings. The attention matrix A ∈ ℝ^{T×T} is calculated as the average of all H = 12 attention heads in the final layer of the encoder.
A = 1 H h = 1 H A h ( 17 )
The attention weight from token i to token j, which is produced by the head in the attention mechanism, is denoted as A^{(h). This weight represents the association between tokens i and j and is normalized using the softmax function over all key token positions. The mean strength of directional attention from token i to token j across all 12 heads in the last encoder layer is indicated by element A in the mean attention matrix.
The heatmap also demonstrates that the three tokens ‘rotten’, ‘feeling, and ‘rather’ received the highest levels of cross-token attention. This suggests that the last encoder layer of the model places the majority of its representative focus on the negative-valued lexical anchors that best predict the sadness label for this particular sample. The special [CLS] token, which is used to store sentence-level context for the classification head, has a distributed pattern of attention to all content tokens; this is consistent with its role in aggregating data for a global classification system. The [SEP] token has a much different, more localized pattern of attention compared to other tokens; this is reflective of its structural function. These patterns correspond with the known properties of the attention mechanism of transformers used in fine-tuned language models for sequence classification and lend qualitative evidence for the semantic focus of the model.

4.8. Integrated Gradients Analysis

Figure 13 shows the integrated gradient token attributions of a representative token from each emotion class. As laid out in Equation (3) and The Completeness Theorem (Theorem 1), each token is assigned a per-token scalar attribution score, which is calculated as the L2 aggregation of the attributions for all the embedding dimensions (d) of the token t via
s t = I G ( x t ) 2 = d = 1 D I G d ( x t ) 2 ( 18 )
where Where D is the total number of embedding dimensions (768 for RoBERTa-base) and I G ( x t ) is the Integrated Gradients, attribution assigned to token t for embedding dimension d per Equation (3). The s t scores are L2 normalized across all tokens in the sequence to create a unit-length normalized attribution distribution (i.e., proportion of total score) that will allow comparing attributions across the samples shown above.
The maximum normalized attribution score for the anger class is 1; therefore, the model makes a correct and direct association between the word anger and the model prediction. For fear, the words uncomfortable and shop had the most positive attributions to the integrated gradient; therefore, words that provide context and a description of an uncomfortable place are also helpful in classifying fear. For joy, the words arrived, deft, and slender had the highest attribution and provided insight into the nuanced expressions of joy in the data. The properties of the integrated gradient, both axiom-satisfying (sensitivity and implementation invariance) and established in Theorem 1, ensure the theoretical soundness of the attribution at the token level, the mathematical completeness of the attribution by summing to the total prediction change from the baseline, and greater reliability of the attribution at the token level than other attribution methods (e.g., traditional gradients, attention-based attribution, etc.) for conducting a proper audit of a model.

5. Discussion

The results in Section 4 show that fine-tuning RoBERTa-base with an appropriate set of regularization techniques (such as weight decay AdamW, dropout, gradient clipping, and early stopping) yields excellent performance in emotion classification on the Emotions for NLP task. The model achieved a weighted F1 of 0.925 and a macro-averaged ROC-AUC of 0.997, outperforming all baseline models presented in Table 5. Of the transformer-based baseline models, XLNet-base had the next closest performance, with weights of F1 = 0.908 and ROC-AUC = 0.995, followed by BERT-base, with F1 = 0.899 and ROC-AUC = 0.994. Thus, the proposed model improved the XLNet-base by 0.0168 in weighted F1 and 0.0017 in ROC-AUC, which shows a statistically significant improvement over XLNet-base while both are using the same set of transformer encoder architecture. In addition, Kamath et al. [9] found competitive F1 scores when using RoBERTa-based emotion detection methods but did not evaluate their models using explainability metrics; therefore, their predictions were completely opaque. Yan et al. [6] provided strong classification results with RoBERTa by incorporating GNNs into their model for classifying emotions on social media platforms.
Multiple XAI analyses showed significant agreement between the three different methods used to understand how the model internally processes its decisions. This study found that there is high agreement on the specific tokens that are most predictive; for instance, high attribution scores were assigned to emotionally-expressive words such as “cold” (SHAP φ̄ = 0.84), “outraged” (φ̄ = 0.80), “fearful” (φ̄ = 0.54), or “optimistic.” The level of agreement across these methods provides strong evidence that the model has learned how to predict based on genuine emotion-based semantic features rather than relying solely on surface-level correlations or artifacts in the analysis dataset.
Additionally, the four Explainable Artificial Intelligence techniques provided offer different yet complementary approaches at an analytical level to help form a more comprehensive understanding of a model’s behavior than any singular technique could generate on its own. SHAP provides an overarching understanding at the dataset level of which tokens (features) are universally significant across all predictions and allows for audit-level transparency and the ability to detect biases systemically. In contrast, LIME reveals how token significance changes contextually at the individual sample level and that a global attribution may not accurately represent the variability of the local decision boundaries. Attention Visualization offers a unique structural understanding of token-to-token connectivity in the transformer’s self-attention mechanism, which cannot be captured by gradient-based and perturbation-based methods.
Integrated Gradients offers the most theoretically rigorous local attribution model satisfying the requirements of Completeness, Sensitivity and Implementation Invariance, as found in Theorem 1, thus providing mathematically sound and verifiable attributions that are not provided by LIME and attention-based methods. Therefore, a practitioner implementing this model in production can rely on SHAP to generate audit-level global feature reporting, LIME to communicate individual predicted outcomes to the end-user, attention maps to debug structural attention anomalies, and IG to document formal regulatory-compliant documentation.
The analysis performed using a per-class analysis identified the greatest limitation of the current framework: the inability to adequately represent surprise in the training data. With only 172 training examples compared with 4,155 for joy, it had the lowest per-class F1 score at 0.79 with a recall of 0.80. Similarly, love has a relatively low F1 score of 0.82 because of its large semantic overlap with joy in positive affect language contexts. Future work should methodically evaluate data augmentation techniques, back-translation, contextual paraphrase generation, and synthetic sample generation using large language models to reduce this inherent structural class imbalance, thus enhancing the recognition of the minority class.
Two additional limitations deserve mention. Initially, the attributions for Integrated Gradients and attention maps are calculated at the embedding level of subword tokens, leading to non-human-readable explanations at the subword level regarding interpretation based on BPE tokenization for RoBERTa. Future works should investigate possible means of aggregating subword tokens into larger parts that impart to either of these two the explanation of the word subword aggregate to enhance their ability to accurately depict the correct interpretation of plateaus in the value of random variable `y`, inline with `z=0` to produce accurate and significant descriptions of a word entity, thus increasing the readability of the explanation. Second, the computational overhead incurred by querying the model multiple times through the generation of multiple coalition feature samples for non-conditioned explanations of a coalition of features means that SHAP is not scalable for real-time production environments, were from sheer sample size alone, a model will require running within large quantities of data. Alternatives to these conditions provide viable approaches to large-scale production operating under the SHAP paradigm without sacrificing theoretical guarantees based on the defined theoretical Shapley framework using fewer background samples.

6. Conclusions

This study laid out an entire explainable emotion recognition framework from the combination of four complementary techniques of SHAP, LIME, attention visualization, and integrated gradients, built on the RoBERTa-base transformer model with a whole experimental pipeline that is reproducibly integrated. After being fine-tuned on the Emotions for NLP benchmark of 20,000 annotated emotions of six different types, the entire framework achieved an accuracy of 0.924, weighted F1 of 0.925, and macro-average ROC-AUC of 0.997, precision of 0.925 on the test set. The results of this new architecture for the emotion recognition model provided a highly competitive performance benchmark and outperformed all baseline comparisons used in the study. The use of the four XAI techniques datasets together showed a better characterization of how the model made its predictions about emotions by adding to what one of these XAI techniques alone could have provided to explain themselves fully and without any redundant information.
From these findings, four main conclusions can be drawn. First, the RoBERTa-base model fine-tuned with the right forms of regulation produces a state-of-the-art level of performance on the six-class emotion-based classification benchmark without requiring any changes to its architecture. Second, the four explained artificial intelligence (XAI) techniques provide truly different types of explanations about how a model arrives at outputs; therefore, no single explanation technique defines all information about a model’s inner workings. Third, the high degree of agreement across the four XAI approaches when determining that emotionally evocative vocabulary words were the strongest predictive features demonstrates that the RoBERTa-base model has learned meaningful semantic representations and not just non-meaningful associations among attributes. Finally, the low per-class F1-score for the “surprise” class was attributed to the extreme class imbalance in the training dataset, which is a limitation of this study and a clear area for improvement in future research efforts.
This project has many potential directions for future research. Multilingual emotion recognition is the next logical step using the following language model: multilingual RoBERTa variants. Several effective data augmentation methods (e.g., back-translation, paraphrase generation, and synthetic sample creation using large language models) can be developed to ameliorate issues related to class imbalances within minority emotion categories. The application of our findings to real-time social media monitoring tools for mental health surveillance is significant. In terms of explainability, the application of concept-based XAI methods (e.g., TCAV) is promising for producing higher-level semantic explanations in addition to token-level attribution. Additionally, one further challenge to the research community at large that remains to be addressed is the development of a common set of unified XAI evaluation metrics to objectively assess (1) faithfulness, (2) stability, and (3) human interpretability across various approaches to explainability.

Author Contributions

Conceptualization, M.A. and N.R.; methodology, M.A. and W.A.; software, M.A. and A.A.; validation, N.R., W.A. and A.A.; formal analysis, M.A. and N.R.; investigation, M.A. and A.A.; resources, W.A. and N.R.; data curation, A.A. and M.A.; writing — original draft preparation, M.A.; writing — review and editing, N.R., W.A., A.A. and D.A.D.; visualization, M.A. and A.A.; supervision, N.R., W.A. and D.A.D.; project administration, M.A., W.A. and M.Ar. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available on Kaggle at https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp . The experimental code and trained model checkpoint are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  2. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  3. Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3319–3328. [Google Scholar]
  4. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates: Red Hook, NY, USA, 2017; Volume 30, pp. 4765–4774. [Google Scholar]
  5. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  6. Yan, X.; Liu, Z.; Wang, G. Emotion-RGC Net: A Novel Approach for Emotion Recognition in Social Media Using RoBERTa and Graph Neural Networks. PLOS ONE 2025, 20, e0318524. [Google Scholar] [CrossRef] [PubMed]
  7. Chefer, H.; Gur, S.; Wolf, L. Transformer Interpretability Beyond Attention Visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 782–791. [Google Scholar]
  8. Deng, J.; Ren, F. A Survey of Textual Emotion Recognition and Its Challenges. IEEE Trans. Affect. Comput. 2023, 14, 49–67. [Google Scholar] [CrossRef]
  9. Kamath, R.; Ghoshal, A.; Eswaran, S.; Honnavalli, P. An Enhanced Context-Based Emotion Detection Model Using RoBERTa. In Proceedings of the IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022. [Google Scholar] [CrossRef]
  10. Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Klibert, N.; Fan, N.; Araya, S.; et al. Captum: A Unified and Generic Model Interpretability Library for PyTorch. arXiv 2020, arXiv:2009.07896. [Google Scholar] [CrossRef]
  11. Salih, A.; Galazzo, I.B.; Cruciani, F.; Brusini, L.; Radeva, P.; Menegaz, G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. Adv. Intell. Syst. 2023, 2400304. [Google Scholar] [CrossRef]
  12. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  13. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  14. Rathod, M.; Dalvi, C.; Kaur, K.; Patil, S.; Gite, S.; Kamat, P.; Kotecha, K.; Abraham, A.; Gabralla, L.A. Kids’ Emotion Recognition Using Various Deep Learning Models with Explainable AI. Sensors 2022, 22, 8066. [Google Scholar] [CrossRef] [PubMed]
  15. Kusal, S.; Patil, S.; Choudrie, J.; Kotecha, K.; Vora, D.; Pappas, I. A Review on Text-Based Emotion Detection: Techniques, Applications, Datasets, and Future Directions. arXiv 2022, arXiv:2205.03235. [Google Scholar]
  16. Cahyani, D.E.; Patasik, I. Performance Comparison of TF-IDF and Word2Vec Models for Emotion Text Classification. Bull. Electr. Eng. Inform. 2021, 10, 2780–2788. [Google Scholar] [CrossRef]
  17. Xu, G.; Meng, Y.; Qiu, X.; Yu, Z.; Wu, X. Sentiment Analysis of Comment Texts Based on BiLSTM. IEEE Access 2019, 7, 51522–51532. [Google Scholar] [CrossRef]
  18. Xiaoyan, C.; Qihua, L.; Jianguo, Y. GloVe-CNN-BiLSTM Model for Sentiment Analysis on Text Reviews. J. Sensors 2022, 7212366. [Google Scholar] [CrossRef]
  19. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  20. Areshey, A.; Mathkour, H. Transfer Learning for Sentiment Classification Using Bidirectional Encoder Representations from Transformers (BERT) Model. Sensors 2023, 23, 5232. [Google Scholar] [CrossRef] [PubMed]
  21. Adoma, A.F.; Henry, N.; Chen, W. Comparative Analyses of BERT, RoBERTa, DistilBERT, and XLNet for Text-Based Emotion Recognition. In Proceedings of the 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 18–19 December 2020; pp. 117–121. [Google Scholar]
  22. Cortiz, D. Exploring Transformers in Emotion Recognition: A Comparison of BERT, DistilBERT, RoBERTa, XLNet and ELECTRA. In Proceedings of the 3rd International Conference on Control, Robotics and Intelligent System (CCRIS), Virtual, August 2022. [Google Scholar] [CrossRef]
  23. Raza, M.A.; Fränti, P. A Hierarchical Gamma Mixture Model-Based Method for Classification of High-Dimensional Data. Entropy 2019, 21, 906. [Google Scholar] [CrossRef]
  24. Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization. Information 2025, 16, 784. [Google Scholar] [CrossRef]
  25. Balaji, R.L.; Thiruvenkataswamy, C.S.; Batumalay, M.; Duraimutharasan, N.; Devadas, A.D.T.; Yingthawornsuk, T. A Study of Unified Framework for Extremism Classification, Ideology Detection, Propaganda Analysis, and Flagged Data Detection Using Transformers. J. Appl. Data Sci. 2025, 6, 1791–1810. [Google Scholar] [CrossRef]
  26. Azhar, M.; Amjad, A.; Dewi, D.A.; Kasim, S. Efficient Transformer-Based Abstractive Urdu Text Summarization Through Selective Attention Pruning. Information 2025, 16, 991. [Google Scholar] [CrossRef]
  27. Cheema, A.S.; Azhar, M.; Arif, F.; ul Haq, Q.M.; Sohail, M.; Iqbal, A. EGPT-SPE: Story Point Effort Estimation Using Improved GPT-2 by Removing Inefficient Attention Heads. Appl. Intell. 2025, 27 55, 994. [Google Scholar] [CrossRef]
Figure 1. Taxonomy of Explainable Artificial Intelligence (XAI) Methods.
Figure 1. Taxonomy of Explainable Artificial Intelligence (XAI) Methods.
Preprints 209019 g001
Figure 2. EDA visualization panel.
Figure 2. EDA visualization panel.
Preprints 209019 g002
Figure 3. Proposed explainable emotion recognition pipeline.
Figure 3. Proposed explainable emotion recognition pipeline.
Preprints 209019 g003
Figure 4. Model Training Curves.
Figure 4. Model Training Curves.
Preprints 209019 g004
Figure 5. Per-Class Classification Metrics Bar Chart.
Figure 5. Per-Class Classification Metrics Bar Chart.
Preprints 209019 g005
Figure 8. Confusion Matrix Analysis.
Figure 8. Confusion Matrix Analysis.
Preprints 209019 g008
Figure 9. SHAP Global Feature Importance.
Figure 9. SHAP Global Feature Importance.
Preprints 209019 g009
Figure 10. SHAP Waterfall Plot: Single Sample Token-Level Cumulative Contribution.
Figure 10. SHAP Waterfall Plot: Single Sample Token-Level Cumulative Contribution.
Preprints 209019 g010
Figure 11. LIME Local Explanations.
Figure 11. LIME Local Explanations.
Preprints 209019 g011
Figure 12. Attention Heatmap: Last Layer Head-Averaged for Sample Sentence.
Figure 12. Attention Heatmap: Last Layer Head-Averaged for Sample Sentence.
Preprints 209019 g012
Figure 13. Integrated Gradients and Token Attribution Per Emotion Class.
Figure 13. Integrated Gradients and Token Attribution Per Emotion Class.
Preprints 209019 g013
Table 1. Class distribution of the Emotions for NLP dataset across training, validation, and test splits.
Table 1. Class distribution of the Emotions for NLP dataset across training, validation, and test splits.
Emotion Class Train Validation Test Total %
Anger 2,062 274 275 2,611 14.5
Fear 1,555 207 224 1,986 11.0
Joy 4,155 551 695 5,401 30.0
Love 1,027 136 159 1,322 7.3
Sadness 2,104 279 581 2,964 16.5
Surprise 172 23 16 211 1.2
Total 16,000 2,000 2,000 20,000 100
Table 3. Model architecture and training hyperparameters.
Table 3. Model architecture and training hyperparameters.
Hyperparameter / Setting Value
Base Model RoBERTa-base (125M parameters)
Tokenizer Byte-Pair Encoding (BPE), max length 128 tokens
Optimizer AdamW
Learning Rate 2 × 10⁻⁵
Weight Decay 0.01
Batch Size 32 (train), 64 (eval)
Max Epochs 10
Early Stopping Patience 3 epochs (based on val F1)
Dropout 0.1
Gradient Clipping max_norm = 1.0
LR Scheduler Linear warmup with decay
Warmup Steps 10% of total training steps
Random Seed 42
Hardware NVIDIA Tesla T4 GPU (Google Colab)
Table 4. Per-class and averaged classification metrics on the held-out test set (2,000 samples).
Table 4. Per-class and averaged classification metrics on the held-out test set (2,000 samples).
Emotion Class Precision Recall F1-Score Support
Anger 0.93 0.91 0.92 275
Fear 0.88 0.89 0.88 224
Joy 0.96 0.94 0.95 695
Love 0.81 0.84 0.82 159
Sadness 0.95 0.97 0.96 581
Surprise 0.77 0.80 0.79 16
Macro Avg 0.88 0.89 0.89 1,950
Weighted Avg 0.925 0.924 0.925 1,950
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated