Version 1
: Received: 23 April 2024 / Approved: 24 April 2024 / Online: 24 April 2024 (08:29:13 CEST)
How to cite:
Makhmudov, F.; Kultimuratov, A.; Cho, Y. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Preprints2024, 2024041574. https://doi.org/10.20944/preprints202404.1574.v1
Makhmudov, F.; Kultimuratov, A.; Cho, Y. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Preprints 2024, 2024041574. https://doi.org/10.20944/preprints202404.1574.v1
Makhmudov, F.; Kultimuratov, A.; Cho, Y. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Preprints2024, 2024041574. https://doi.org/10.20944/preprints202404.1574.v1
APA Style
Makhmudov, F., Kultimuratov, A., & Cho, Y. (2024). Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Preprints. https://doi.org/10.20944/preprints202404.1574.v1
Chicago/Turabian Style
Makhmudov, F., Alpamis Kultimuratov and Young-Im Cho. 2024 "Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures" Preprints. https://doi.org/10.20944/preprints202404.1574.v1
Abstract
Emotion detection holds significant importance in facilitating human-computer interaction, enhancing the depth of engagement. By integrating this capability, we pave the way for forthcoming AI technologies to possess a blend of cognitive and emotional understanding, bridging the divide between machine functionality and human emotional complexity. This progress has the potential to reshape how machines perceive and respond to human emotions, ushering in an era of empathetic and intuitive artificial systems. This paper introduces a novel approach to multimodal emotion recognition, seamlessly integrating speech and text modalities to accurately infer emotional states. Employing CNNs, we meticulously analyze speech using Mel spectrograms, while a BERT-based model processes the textual component, leveraging its bidirectional layers for profound semantic comprehension. The outputs from both modalities are combined using an attention-based fusion mechanism that optimally weighs their contributions. The proposed method undergoes meticulous testing on two distinct datasets: Carnegie Mellon University’s Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset and the Multimodal Emotion Lines Dataset (MELD). The results demonstrate superior efficacy compared to existing frameworks, achieving an accuracy of 88.4% and an F1-score of 87.9% on the CMU-MOSEI dataset, and a notable weighted accuracy (WA) of 67.81% and a weighted F1 (WF1) score of 66.32% on the MELD dataset. This comprehensive system offers precise emotion detection and introduces several significant advancements in the field.
Keywords
Deep learning; CNN; BERT; Emotion recognition
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.