Submitted:
10 July 2024
Posted:
11 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- To develop MAViTE-Bangla, a multi-modal Bangla emotion dataset containing 1002 multimodal data labeled into four classes: anger, fear, joy, and sadness.
- To develop a pairwise cross-attention-based multimodal framework for effective emotion recognition and exploit several feature extraction and fusion methods to utilize multimodal features for MEC.
- To analyze the classification outcomes of the proposed method with a detailed investigation of the misclassification of samples
2. Related Work
2.1. Unimodal-based Emotion Recognition
2.2. Multimodal-based Emotion Recognition
3. Dataset Description
3.1. Data Collection
3.2. Data Statistics
4. Methodology
4.1. Video Network
4.1.1. Preprocessing
4.1.2. Feature Extraction
4.2. Audio Network
4.2.1. Preprocessing
4.2.2. Feature Extraction
- Hand-crafted Feature Extraction:
- Deep Feature Extraction:
4.3. Text Network
4.3.1. Preprocessing
4.3.2. Feature Extraction
- Word Embedding
- Contextual Embedding
4.4. Pairwise Cross-modal Attention
4.4.1. Audio-Video Attention
4.4.2. Audio-Text Attention
4.4.3. Video-Text Attention
4.5. Fusion Methods
4.5.1. Summation Fusion
4.5.2. Concatenation
4.5.3. Average Fusion
4.5.4. Hadamard Product Fusion
4.6. Proposed Method
5. Experiments
5.1. Unimodal
5.1.1. Audio Modality
5.1.2. Video Modality
5.1.3. Text Modality
5.2. Multimodal
5.3. Ablation Study
5.4. Comparison with Existing Work
5.5. Error Analysis
6. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Beard, R.; Das, R.; Ng, R.W.; Gopalakrishnan, P.K.; Eerens, L.; Swietojanski, P.; Miksik, O. Multi-modal sequence fusion via recursive attention for emotion recognition. Proceedings of the 22nd conference on computational natural language learning, 2018, pp. 251–259.
- Haque, R.; Islam, N.; Tasneem, M.; Das, A.K. Multi-class sentiment classification on Bengali social media comments using machine learning. International journal of cognitive computing in engineering 2023, 4, 21–35. [Google Scholar]
- Islam, K.I.; Yuvraz, T.; Islam, M.S.; Hassan, E. Emonoba: A dataset for analyzing fine-grained emotions on noisy bangla texts. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2022, pp. 128–134.
- Kabir, A.; Roy, A.; Taheri, Z. BEmoLexBERT: A Hybrid Model for Multilabel Textual Emotion Classification in Bangla by Combining Transformers with Lexicon Features. Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), 2023, pp. 56–61.
- Das, A.; Sharif, O.; Hoque, M.M.; Sarker, I.H. Emotion classification in a resource constrained language using transformer-based approach. arXiv 2021. arXiv:2104.08613.
- Iqbal, M.A.; Das, A.; Sharif, O.; Hoque, M.M.; Sarker, I.H. Bemoc: A corpus for identifying emotion in bengali texts. SN Computer Science 2022, 3, 135. [Google Scholar]
- Rahman, M.; Talukder, M.R.A.; Setu, L.A.; Das, A.K. A dynamic strategy for classifying sentiment from Bengali text by utilizing Word2vector model. Journal of Information Technology Research (JITR) 2022, 15, 1–17. [Google Scholar]
- Mia, M.; Das, P.; Habib, A. Verse-Based Emotion Analysis of Bengali Music from Lyrics Using Machine Learning and Neural Network Classifiers. International Journal of Computing and Digital Systems 2024, 15, 359–370. [Google Scholar]
- Parvin, T.; Sharif, O.; Hoque, M.M. Multi-class textual emotion categorization using ensemble of convolutional and recurrent neural network. SN Computer Science 2022, 3, 62. [Google Scholar]
- Sultana, S.; Iqbal, M.Z.; Selim, M.R.; Rashid, M.M.; Rahman, M.S. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks. IEEE Access 2021, 10, 564–578. [Google Scholar]
- Dhar, P.; Guha, S. A system to predict emotion from Bengali speech. Int. J. Math. Sci. Comput 2021, 7, 26–35. [Google Scholar]
- Nahin, A.S.M.; Roza, I.I.; Nishat, T.T.; Sumya, A.; Bhuiyan, H.; Hoque, M.M. Bengali Hateful Memes Detection: A Comprehensive Dataset and Deep Learning Approach. 2024 International Conference on Advances in Computing, Communication, Electrical, and Smart Systems (iCACCESS). IEEE, 2024, pp. 01–06.
- Ghosh, S.; Ramaneswaran, S.; Tyagi, U.; Srivastava, H.; Lepcha, S.; Sakshi, S.; Manocha, D. M-MELD: A Multilingual Multi-Party Dataset for Emotion Recognition in Conversations. arXiv 2022. arXiv:2203.16799.
- Hu, G.; Lin, T.E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. Unimse: Towards unified multimodal sentiment analysis and emotion recognition. arXiv 2022. arXiv:2211.11256.
- Zhao, J.; Dong, W.; Shi, L.; Qiang, W.; Kuang, Z.; Xu, D.; An, T. Multimodal Feature Fusion Method for Unbalanced Sample Data in Social Network Public Opinion. Sensors 2022, 22. [Google Scholar] [CrossRef]
- Hosseini, S.S.; Yamaghani, M.R.; Poorzaker Arabani, S. Multimodal modelling of human emotion using sound, image and text fusion. Signal, Image and Video Processing 2024, 18, 71–79. [Google Scholar]
- Shayaninasab, M.; Babaali, B. Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers. arXiv 2024. arXiv:2402.07327.
- Mamieva, D.; Abdusalomov, A.B.; Kutlimuratov, A.; Muminov, B.; Whangbo, T.K. Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features. Sensors 2023, 23. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, S.; Ni, D.; Wei, Z.; Yang, K.; Jin, S.; Huang, G.; Liang, Z.; Zhang, L.; Li, L.; Ding, H.; Zhang, Z.; Wang, J. Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data. Sensors 2024, 24. [Google Scholar] [CrossRef]
- Taheri, Z.S.; Roy, A.C.; Kabir, A. BEmoFusionNet: A Deep Learning Approach For Multimodal Emotion Classification in Bangla Social Media Posts. 2023 26th International Conference on Computer and Information Technology (ICCIT). IEEE, 2023, pp. 1–6.
- Hossain, E.; Sharif, O.; Hoque, M.M. Mute: A multimodal dataset for detecting hateful memes. Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop, 2022, pp. 32–39.
- Ahsan, S.; Hossain, E.; Sharif, O.; Das, A.; Hoque, M.M.; Dewan, M. A Multimodal Framework to Detect Target Aware Aggression in Memes. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2487–2500.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A.C.; Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; others. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020. arXiv:2010.11929.
- Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1701–1708. [CrossRef]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015. [CrossRef]
- Baltrušaitis, T.; Robinson, P.; Morency, L.P. OpenFace: An open source facial behavior analysis toolkit. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1–10. [CrossRef]
- Sen Sarma, M.; Das, A. BMGC: A deep learning approach to classify Bengali music genres. Proceedings of the 4th International Conference on Networking, Information Systems & Security, 2021, pp. 1–6.
- Google. YamNet: Pretrained model for audio event detection, 2020. Accessed: 2024-06-22.
- Amiriparian, S.; Gerczuk, M.; Ottl, S.; Cummins, N.; Freitag, M.; Pugachevskiy, S.; Baird, A.; Schuller, B. Snore Sound Classification Using Image-Based Deep Spectrum Features. Interspeech 2017. ISCA, 2017, pp. 3512–3516.
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013. arXiv:1301.3781.
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2; Curran Associates Inc.: Red Hook, NY, USA, 2013; NIPS’13, p. 3111–3119.
- Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv 2016. arXiv:1612.03651.
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 2017, 5, 135–146. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019. arXiv:1810.04805.
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019. arXiv:1907.11692.
- Sarker, S. BanglaBERT: Bengali Mask Language Model for Bengali Language Understading, 2020.
- Bhattacharjee, A.; Hasan, T.; Samin, K.; Islam, M.S.; Rahman, M.S.; Iqbal, A.; Shahriyar, R. BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding. CoRR 2021. arXiv:2101.00204.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018. arXiv:1810.04805.





| Class | Train | Validation | Test |
|---|---|---|---|
| Anger | 221 | 48 | 48 |
| Fear | 70 | 15 | 15 |
| Joy | 191 | 41 | 41 |
| Sadness | 218 | 47 | 47 |
| Total | 700 | 151 | 151 |
| Class | Min Duration (sec) | Max Duration (sec) | Mean Duration (sec) | Total Duration (sec) |
|---|---|---|---|---|
| Anger | 1.022 | 7.002 | 3.210 | 950.071 |
| Fear | 2.001 | 14.008 | 5.491 | 549.061 |
| Joy | 0.952 | 9.015 | 2.937 | 757.709 |
| Sadness | 1.207 | 7.012 | 3.391 | 986.822 |
| Class | Max Frame Rate (fps) | Min Frame Rate (fps) | Max Resolution | Min Resolution | Max File Size (KB) | Min File Size (KB) |
|---|---|---|---|---|---|---|
| Anger | 30.0 | 24.00 | 1920x1080 | 208x210 | 10,075.15 | 141.32 |
| Fear | 30.0 | 23.98 | 2560x1440 | 450x360 | 6,845.49 | 99.54 |
| Joy | 30.0 | 23.98 | 1920x1080 | 206x174 | 9,975.87 | 102.99 |
| Sadness | 30.0 | 24.00 | 1920x1080 | 176x162 | 11,053.24 | 91.93 |
| Class | Total Words | Total Sentences | Average Word Length | Average Sentence Length | Lexical Diversity |
|---|---|---|---|---|---|
| Anger | 2395 | 88 | 4.356 | 27.216 | 0.509 |
| Fear | 996 | 34 | 4.070 | 29.294 | 0.495 |
| Joy | 1819 | 54 | 4.318 | 33.685 | 0.537 |
| Sadness | 1971 | 31 | 4.337 | 63.581 | 0.464 |

| Fusion Techniques of Cross-Modal Attended Features | Pr | Re | F1 | Acc |
|---|---|---|---|---|
| Summation | 0.60 | 0.59 | 0.59 | 0.60 |
| Averaging | 0.56 | 0.57 | 0.55 | 0.56 |
| Concatenation | 0.65 | 0.64 | 0.64 | 0.64 |
| Hadamard Product | 0.62 | 0.61 | 0.62 | 0.63 |
| Class | Pr. | Re. | F1. |
|---|---|---|---|
| Anger | 0.70 | 0.69 | 0.69 |
| Sadness | 0.63 | 0.53 | 0.58 |
| Joy | 0.70 | 0.73 | 0.71 |
| Fear | 0.42 | 0.53 | 0.47 |

| Cross Modal Attention Strategies | Pr. | Re. | F1. | Acc. |
|---|---|---|---|---|
| Att_AV + Att_VT + Att_AT | 0.63 | 0.63 | 0.62 | 0.63 |
| Att_VA + Att_TV + Att_TA | 0.62 | 0.61 | 0.62 | 0.62 |
| Att_AV + Att_VT + Att_AT + Att_VA + Att_TV + Att_TA | 0.65 | 0.64 | 0.64 | 0.64 |
| Multimodal Feature Extraction | Classifier | F1 |
|---|---|---|
| Wav2vec2.0 + videoMAE + BERT [17] | SVM | 0.59 |
| (Handcrafted audio features + CNN-LSTM) + Inception-ResNet-v2 + Word2Vec [16] | ANN | 0.57 |
| Yamnet + DeepFace + Bangla BERT-2(Proposed) | ANN | 0.64 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).