Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures

Version 1 : Received: 23 April 2024 / Approved: 24 April 2024 / Online: 24 April 2024 (08:29:13 CEST)

How to cite: Makhmudov, F.; Kultimuratov, A.; Cho, Y. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Preprints 2024, 2024041574. https://doi.org/10.20944/preprints202404.1574.v1 Makhmudov, F.; Kultimuratov, A.; Cho, Y. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Preprints 2024, 2024041574. https://doi.org/10.20944/preprints202404.1574.v1

Abstract

Emotion detection holds significant importance in facilitating human-computer interaction, enhancing the depth of engagement. By integrating this capability, we pave the way for forthcoming AI technologies to possess a blend of cognitive and emotional understanding, bridging the divide between machine functionality and human emotional complexity. This progress has the potential to reshape how machines perceive and respond to human emotions, ushering in an era of empathetic and intuitive artificial systems. This paper introduces a novel approach to multimodal emotion recognition, seamlessly integrating speech and text modalities to accurately infer emotional states. Employing CNNs, we meticulously analyze speech using Mel spectrograms, while a BERT-based model processes the textual component, leveraging its bidirectional layers for profound semantic comprehension. The outputs from both modalities are combined using an attention-based fusion mechanism that optimally weighs their contributions. The proposed method undergoes meticulous testing on two distinct datasets: Carnegie Mellon University’s Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset and the Multimodal Emotion Lines Dataset (MELD). The results demonstrate superior efficacy compared to existing frameworks, achieving an accuracy of 88.4% and an F1-score of 87.9% on the CMU-MOSEI dataset, and a notable weighted accuracy (WA) of 67.81% and a weighted F1 (WF1) score of 66.32% on the MELD dataset. This comprehensive system offers precise emotion detection and introduces several significant advancements in the field.

Keywords

Deep learning; CNN; BERT; Emotion recognition

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.