Submitted:
16 July 2025
Posted:
17 July 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Related Works
2.1. IoT-Based Emotion Recognition Systems
2.2. Multimodal Fusion Techniques
2.3. Temporal Modeling in Emotion Recognition
2.4. Edge Computing for Emotion Recognition
3. Methodology
3.1. System Architecture Overview
3.2. Multi-Scale Temporal Feature Extraction

3.3. Adaptive Fusion Mechanism
3.4. Edge Computing Optimization


3.5. Real-Time Emotion Classification
3.6. Privacy and Security Considerations
4. Experimental Setup
4.1. Datasets and Data Preparation
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Baseline Comparisons
5. Results and Analysis
5.1. Overall Performance Comparison
| Method | MELD | DEAP | G-REx | Avg. | Latency |
|---|---|---|---|---|---|
| Acc.(%) | MAE(V/A) | Acc.(V/A%) | Acc.(%) | (ms) | |
| CNN-Face | 73.2 | 0.142/0.156 | 68.4/71.2 | 71.3 | 45 |
| LSTM-Speech | 71.8 | 0.168/0.174 | 65.7/69.3 | 69.0 | 52 |
| SVM-Physio | 69.4 | 0.139/0.145 | 72.1/75.8 | 72.4 | 12 |
| Early Fusion | 82.1 | 0.124/0.131 | 76.5/78.9 | 79.1 | 156 |
| Late Fusion | 84.6 | 0.118/0.125 | 79.2/81.7 | 81.8 | 178 |
| Attention Fusion | 87.4 | 0.113/0.119 | 82.5/85.4 | 85.1 | 203 |
| EmotionTFN (Proposed) | 94.2 | 0.087/0.094 | 89.7/91.3 | 91.8 | 187 |
| Dataset | Modalities | Participants | Samples | Duration | Emotion Labels | IoT Relevance |
|---|---|---|---|---|---|---|
| MELD | Audio, Video, Text | 1,433 | 13,708 | ~24 hours | 7 discrete | Conversational |
| DEAP | EEG(32ch), PPG, GSR | 32 | 1,280 | 63 minutes | 4 dimensional | Physiological |
| G-REx | EDA, PPG, ACC | 73 | 1,168 | 32 minutes | 2 dimensional | Wearable sensors |
| Platform | CPU | Memory | GPU | Latency | Throughput | Power | Energy |
|---|---|---|---|---|---|---|---|
| (GB) | (ms) | (FPS) | (W) | (J/inf) | |||
| RTX 4060 | - | 4.2 | 10496 CUDA | 23 | 43.5 | 320 | 7.36 |
| Jetson Xavier NX | ARM64 | 2.1 | 384 CUDA | 187 | 5.3 | 15 | 2.81 |
| Raspberry Pi 4 | ARM64 | 1.8 | - | 298 | 3.4 | 8 | 2.68 |
| Jetson Nano | ARM64 | 1.2 | 128 CUDA | 445 | 2.2 | 5 | 2.23 |
5.2. Ablation Study
| Configuration | MELD Acc (%) | G-REx Val (%) | G-REx Aro (%) | DEAP Val (MAE) | DEAP Aro (MAE) | Latency (ms) |
|---|---|---|---|---|---|---|
| Full EMOTIONTFN | 94.2 | 89.7 | 91.3 | 0.087 | 0.094 | 187 |
| w/o Multi-scale | 85.9 | 83.0 | 85.6 | 0.121 | 0.127 | 145 |
| w/o Adaptive Fusion |
89.8 | 85.5 | 87.1 | 0.098 | 0.104 | 156 |
| w/o Cross-modal Att | 91.4 | 87.2 | 89.5 | 0.092 | 0.098 | 168 |
| w/o Edge Optimization |
93.8 | 89.1 | 90.7 | 0.089 | 0.095 | 312 |
| Fixed Windows | 88.1 | 84.3 | 86.8 | 0.105 | 0.112 | 163 |
| Single Modality Best | 76.5 | 75.2 | 78.1 | 0.135 | 0.142 | 89 |
5.3. Computational Efficiency Analysis
| Hardware Platform | Latency (ms) | Memory (GB) | Power (W) | Throughput (FPS) | Energy per Inference (J) |
|---|---|---|---|---|---|
| NVIDIA RTX 3090 | 23 | 4.2 | 320 | 43.5 | 7.36 |
| Jetson Xavier NX | 187 | 2.1 | 15 | 5.3 | 2.81 |
| Raspberry Pi 4 | 298 | 1.8 | 8 | 3.4 | 2.68 |
| Jetson Nano | 445 | 1.2 | 5 | 2.2 | 2.23 |
5.4. Multi-Scale Temporal Analysis
| Emotion | Short-term (0.5-2s) | Medium-term (2-10s) | Long-term (10-60s) |
|---|---|---|---|
| Anger | 0.42 | 0.38 | 0.20 |
| Disgust | 0.38 | 0.35 | 0.27 |
| Fear | 0.45 | 0.33 | 0.22 |
| Happiness | 0.32 | 0.41 | 0.27 |
| Neutral | 0.28 | 0.35 | 0.37 |
| Sadness | 0.25 | 0.38 | 0.37 |
| Surprise | 0.51 | 0.31 | 0.18 |
5.5. Robustness and Generalization Analysis
| Condition | Accuracy Drop (%) | Latency Impact (ms) | Recovery Time (s) |
|---|---|---|---|
| Low Light (50 lux) | 2.1 | +12 | 0.8 |
| High Noise (80 dB) | 2.8 | +8 | 1.2 |
| User Movement | 1.9 | +15 | 0.5 |
| Single Sensor Failure | 2.1-4.7 | -23 | 0.3 |
| Two Sensor Failures | 6.2-9.1 | -45 | 0.7 |
| Network Latency | 0.3 | +34 | 2.1 |
5.6. Real-World Deployment Case Study
| Metric | Week 1 | Week 2 | Week 3 | Week 4 | Average |
| System Uptime (%) | 95.2 | 97.8 | 98.1 | 97.6 | 97.2 |
| Accuracy (%) | 91.3 | 92.1 | 92.8 | 92.4 | 92.2 |
| Avg. Latency (ms) | 215 | 208 | 203 | 198 | 206 |
| User Satisfaction (1-5) | 3.8 | 4.1 | 4.3 | 4.2 | 4.1 |
| Privacy Concerns (1-5) | 2.1 | 1.8 | 1.6 | 1.7 | 1.8 |
6. Discussion
6.1. Technical Contributions and Significance
6.2. Implications for IoT Emotion Recognition
6.3. Limitations and Future Directions
6.4. Broader Impact and Ethical Considerations
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Atzori, L.; Iera, A.; Morabito, G. The internet of things: A survey. Comput. Netw. 2010, 54, 2787–2805. [Google Scholar] [CrossRef]
- Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
- Kołakowska, A.; et al. Emotion recognition and its applications. Hum.-Comput. Syst. Interact. 2014, 251–262. [Google Scholar]
- Gubbi, J.; Buyya, R.; Marusic, S.; Palaniswami, M. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Gener. Comput. Syst. 2013, 29, 1645–1660. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
- Zeng, Z.; Pantic, M.; Roisman, G.I.; Huang, T.S. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 39–58. [Google Scholar] [CrossRef]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
- Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Yuan, Y. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy 2023, 25, 1661. [Google Scholar] [CrossRef]
- Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Yuan, Y. A survey of deep learning-based multimodal emotion recognition: datasets, methods and challenges. Appl. Intell. 2023, 53, 9570–9589. [Google Scholar] [CrossRef]
- Geetha, A.V.; Darshan, H.; Susrutha, K. Multimodal emotion recognition with deep learning: advancements, challenges, and future directions. Neurocomputing 2024, 573, 127217. [Google Scholar] [CrossRef]
- Udahemuka, G.; Ruhunage, I.; Kaminduwa Gamage, D.; Priyankara, H.D.N.; Perera, A.S.; Ragel, R. Multimodal emotion recognition using visual, vocal and physiological modalities. Appl. Sci. 2024, 14, 2155. [Google Scholar] [CrossRef]
- Ramaswamy, M.P.A.; Kumar, N.; Venkatesh, S. Multimodal emotion recognition: A comprehensive review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1503. [Google Scholar] [CrossRef]
- Kim, K.H.; Bang, S.W.; Kim, S.R. Emotion recognition system using short-term monitoring of physiological signals. Med. Biol. Eng. Comput. 2004, 42, 419–427. [Google Scholar] [CrossRef] [PubMed]
- Shu, L.; Xie, J.; Yang, M.; Li, Z.; Li, Z.; Liao, D.; Yang, X. A review of emotion recognition using physiological signals. Sensors 2018, 18, 2074. [Google Scholar] [CrossRef] [PubMed]
- Ham, S.M.; Choi, Y.J.; Choi, J.W.; Kim, D.H. A negative emotion recognition system with Internet of Things. Electronics 2023, 12, 1359. [Google Scholar] [CrossRef]
- Bravo, L.; Villarreal, V.; Cerna, J. A systematic review on artificial intelligence-based multimodal dialogue systems with emotion recognition. Multimodal Technol. Interact. 2025, 9, 8. [Google Scholar] [CrossRef]
- Sharma, D.; Gupta, A.; Singh, P. Smart emotion detection: An AI and IoT approach to speech analysis. EELET 2024, 3, 45–52. [Google Scholar]
- Lahat, D.; Adali, T.; Jutten, C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. IEEE 2015, 103, 1449–1477. [Google Scholar] [CrossRef]
- Wang, Y.; Guan, L.; Venetsanopoulos, A.N. Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans. Multimedia 2012, 14, 597–607. [Google Scholar] [CrossRef]
- Chen, M.; Wang, S.; Liang, P.P.; Baltrušaitis, T.; Zadeh, A.; Morency, L.P. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the ACM International Conference on Multimodal Interaction; ACM: New York, NY, USA, 2017; pp. 163–171. [Google Scholar] [CrossRef]
- Bang, J.; Kim, H.; Lee, H. A hybrid multimodal emotion recognition framework for UX evaluation using generalized mixture functions. Sensors 2023, 23, 3644. [Google Scholar] [CrossRef]
- Shi, X.; Zhang, Y.; Wang, L. Multimodal fusion of music theory-inspired and self-supervised representations for improved emotion recognition. Proceedings of the Interspeech; 2024. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems 2017, 30. [Google Scholar]
- Chen, W.; Xing, X.; Xu, X.; Pang, J.; Du, L. Multimodal emotion recognition based on facial expressions, speech, and EEG. IEEE Open J. Eng. Med. Biol. 2023, 4, 81–89. [Google Scholar] [CrossRef]
- Li, J.; Yang, Z.; Zhang, H.; Xu, M.; Zhao, S.; Liu, M.; Sun, M. Decoupled multimodal distilling for emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 6631–6640. [Google Scholar]
- Dai, W.; Cahyawijaya, S.; Liu, Z.; Fung, P. A novel approach for multimodal emotion recognition: Multimodal semantic information fusion. arXiv 2024, arXiv:2407.12173. [Google Scholar] [CrossRef]
- Cheng, Z.; Liu, X.; Li, J.; Wang, H.; Zhang, Y. Emotion-LLaMA: Multimodal emotion recognition and reasoning with instruction tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; 2024. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019; pp. 527–536. [Google Scholar] [CrossRef]
- Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.S.; Yazdani, A.; Ebrahimi, T.; Patras, I. DEAP: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 2012, 3, 18–31. [Google Scholar] [CrossRef]
- Kraack, K. A multimodal emotion recognition system: Integrating facial expressions, body movement, speech, and spoken language. arXiv 2024, arXiv:2406.15063. [Google Scholar] [CrossRef]
- Li, J.; Zhang, X.; Huang, L.; Li, F.; Shen, S.; Liu, J.; Shang, L. Multimodal emotion recognition in conversation based on hypergraph. Electronics 2023, 12, 2664. [Google Scholar] [CrossRef]
- Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Yuan, Y. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy 2023, 25, 1661. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).