Submitted:
01 September 2025
Posted:
03 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Problem Statement and Research Questions
3. Literature Review
4. Theoretical and Computational Framework
5. Proposed Methodology
- Early Fusion: Raw biometric and visual signals will be combined prior to feature extraction, following approaches tested in multimodal affect detection (Tellamekala et al., 2023; Wang et al., 2022).
- Late Fusion: Independent unimodal models will be trained and merged at the decision stage, leveraging ensemble and evidential methods (El-Din et al., 2023; Liang et al., 2025).
- Hybrid Fusion: A shared feature space with cross-attention mechanisms will be implemented to dynamically integrate complementary cues, drawing on advances in interpretable fusion frameworks (Mansouri-Benssassi & Ye, 2021; Shakhovska et al., 2024; Zhao et al., 2021).
- Facial micro-expressions via video capture (Ekman & Friesen, 1978; Huang et al., 2023).
- Biometric signals including heart rate variability, galvanic skin response, and skin temperature (Pessanha & Salah, 2021; Beatton et al., 2024; Mattern et al., 2023).
- Self-report surveys to assess empathy, trust, and satisfaction (Fang et al., 2023; Harris et al., 2023).
- System logs capturing adaptive responses, latency, and feedback timing (Shore et al., 2023).
- Quantitative Analysis: Statistical comparisons (accuracy, precision, recall, F1-scores, ANOVA) will test the performance of early, late, and hybrid fusion approaches under real-time constraints (Wu et al., 2023; Hassan et al., 2025).
- Qualitative Analysis: Thematic coding of user reflections will assess perceived empathy, effectiveness, and trust (Niebuhr & Valls-Ratés, 2024; Rossing et al., 2024).
- Computational Trade-offs: The study will measure efficiency, interpretability, and scalability of different fusion strategies in dynamic environments (Bian et al., 2023; Bose et al., 2023; Liao et al., 2025).
- Early Fusion – raw biometric + visual inputs combined before feature extraction (Tellamekala et al., 2023; Wang et al., 2022).
- Late Fusion – unimodal outputs merged at decision level (Liang et al., 2025; El-Din et al., 2023).
- Hybrid Fusion – feature-level concatenation with cross-attention for adaptive weighting (Mansouri-Benssassi & Ye, 2021; Song et al., 2024; Wu et al., 2023).
- Evaluation
- Data: facial video (micro-expressions) + physiological signals (HRV, GSR, temperature).
- Benchmarks: accuracy, F1, latency, interpretability across datasets (DEAP, MAHNOB-HCI).
- Metrics: precision, recall, computational efficiency, robustness across populations (Gupta et al., 2024; Hassan et al., 2025).
6. Expected Contribution
- Development of novel fusion architectures (early, late, and hybrid) to integrate facial expressions with biometric signals for robust multimodal affect recognition.
- Comparative evaluation of fusion strategies under real-time constraints, addressing performance, accuracy, and scalability challenges in dynamic environments.
- Establishment of a computational framework linking multimodal affect recognition to adaptive decision-making, thereby deepening the scientific understanding of embodied AI.
- Applied Contributions
- Design and testing of a prototype Embodied AI Coach capable of delivering real-time adaptive feedback informed by users’ affective and physiological states.
- Practical insights for deploying multimodal AI in education, coaching, and therapy, highlighting opportunities for more personalized and empathetic interventions.
- Development of ethical guidelines for the collection and use of facial and biometric data, supporting responsible innovation and safeguarding user trust.
7. Limitations and Future Research
- Cross-domain generalization — testing architectures in varied applied domains such as education, healthcare, and workplace coaching.
- Longitudinal evaluation — measuring how multimodal models adapt to user changes over time rather than single-session trials.
- Neuro-inspired integration — leveraging bio-inspired computational models (Mansouri-Benssassi & Ye, 2021) to improve affective state modeling.
- Ethical frameworks — developing governance standards for multimodal AI systems to ensure responsible deployment in sensitive human-centered contexts.
References
- Afzal, S., Khan, H. A., Piran, M. J., & Lee, J. W. (2024). A comprehensive survey on affective computing: Challenges, trends, applications, and future directions. IEEE Access, 12, 96150. [CrossRef]
- Alazraki, L., Ghachem, A., Polydorou, N., Khosmood, F., & Edalat, A. (2021). An empathetic AI coach for self-attachment therapy. 2021 IEEE 3rd International Conference on Cognitive Machine Intelligence (CogMI), 78–85. [CrossRef]
- Ali, K., & Hughes, C. E. (2023). A unified transformer-based network for multimodal emotion recognition. TechRxiv. [CrossRef]
- Andrews, J. T. A., Zhao, D., Thong, W., Modas, A., Papakyriakopoulos, O., & Xiang, A. (2023). Ethical considerations for responsible data curation. arXiv. [CrossRef]
- Arsalan, A., Anwar, S. M., & Majid, M. (2022). Human stress assessment: A comprehensive review of methods using wearable sensors and non-wearable techniques. arXiv. [CrossRef]
- Atrey, P. K., Hossain, M. A., El Saddik, A., & Kankanhalli, M. S. (2010). Multimodal fusion for multimedia analysis: A survey. ACM Computing Surveys, 42(2), 1–37. [CrossRef]
- Awan, A. W., Usman, S. M., Khalid, S., Anwar, A., Alroobaea, R., Hussain, S., Almotiri, J., Ullah, S. S., & Akram, M. U. (2022). An ensemble learning method for emotion charting using multimodal physiological signals. Sensors, 22(23), 9480. [CrossRef]
- Barker, D., Tippireddy, M. K. R., Farhan, A., & Ahmed, B. (2025). Ethical considerations in emotion recognition research. Psychology International, 7(2), 43. [CrossRef]
- Barsalou, L. W. (2008). Grounded cognition. Annual Review of Psychology, 59, 617–645. [CrossRef]
- Bassi, G., Giuliano, C., Perinelli, A., Forti, S., Gabrielli, S., & Salcuni, S. (2021). A virtual coach (Motibot) for supporting healthy coping strategies among adults with diabetes: Proof-of-concept study. JMIR Human Factors, 9(1). [CrossRef]
- Bello, H., Marin, L. A. S., Suh, S., Zhou, B., & Lukowicz, P. (2023). InMyFace: Inertial and mechanomyography-based sensor fusion for wearable facial activity recognition. Information Fusion, 99, 101886. [CrossRef]
- Bian, Y., Küster, D., Liu, H., & Krumhuber, E. G. (2023). Understanding naturalistic facial expressions with deep learning and multimodal large language models. Sensors, 24(1), 126. [CrossRef]
- Bose, D., Hebbar, R., Somandepalli, K., & Narayanan, S. (2023). Contextually-rich human affect perception using multimodal scene information. ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. [CrossRef]
- Bota, P., Wang, C., Fred, A., & Silva, H. (2020). Emotion assessment using feature fusion and decision fusion classification based on physiological data: Are we there yet? Sensors, 20(17), 4723. [CrossRef]
- Cacioppo, J. T., Tassinary, L. G., & Berntson, G. G. (2007). Handbook of psychophysiology (3rd ed.). Cambridge University Press. [CrossRef]
- Calvo, R. A., & D’Mello, S. K. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on Affective Computing, 1(1), 18–37. [CrossRef]
- Chavan, V., Cenaj, A., Shen, S., Bar, A., Binwani, S., Del Becaro, T., Funk, M., Greschner, L., Hung, R., Klein, S., Kleiner, R., Krause, S., Olbrych, S., Parmar, V., Sarafraz, J., Soroko, D., Don, D. W., Zhou, C., Vu, H. T. D., … Fresquet, X. (2025). Feeling machines: Ethics, culture, and the rise of emotional AI. arXiv. [CrossRef]
- Chen, J., Wang, X., Huang, C., Hu, X., Shen, X., & Zhang, D. (2023). A large finer-grained affective computing EEG dataset. Scientific Data, 10(1). [CrossRef]
- Das, A., Sarma, M. S., Hoque, M. M., Siddique, N., & Dewan, M. A. A. (2024). AVaTER: Fusing audio, visual, and textual modalities using cross-modal attention for emotion recognition. Sensors, 24(18), 5862. [CrossRef]
- D’Mello, S., & Kory, J. (2015). A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys, 47(3), 1–36. [CrossRef]
- Dol, A., van Strien, T., Velthuijsen, H., van Gemert-Pijnen, J. E. W. C., & Bode, C. (2023). Preferences for coaching strategies in a personalized virtual coach for emotional eaters: An explorative study. Frontiers in Psychology, 14, 1260229. [CrossRef]
- Ekman, P., & Friesen, W. V. (1978). Facial Action Coding System: A technique for the measurement of facial movement. Consulting Psychologists Press.
- Fang, Y., Rong, R. M., & Huang, J. (2021). Hierarchical fusion of visual and physiological signals for emotion recognition. Multidimensional Systems and Signal Processing, 32(4), 1103–1124. [CrossRef]
- Gao, R., Liu, X., Xing, B., Yu, Z., Schuller, B. W., & Kälviäinen, H. (2024). Identity-free artificial emotional intelligence via micro-gesture understanding. arXiv. [CrossRef]
- Goleman, D. (1995). Emotional intelligence: Why it can matter more than IQ. Bantam Books.
- Gross, J. J. (1998). The emerging field of emotion regulation: An integrative review. Review of General Psychology, 2(3), 271–299. [CrossRef]
- Guntz, T. (2020). Estimating expertise from eye gaze and emotions. HAL. https://theses.hal.science/tel-03026375.
- Gupta, N., Priya, R. V., & Verma, C. K. (2024). ERFN: Leveraging context for enhanced emotion detection. International Journal of Advanced Computer Science and Applications, 15(6). [CrossRef]
- Habibi, R., Pfau, J., Holmes, J., & El-Nasr, M. S. (2023). Empathetic AI for empowering resilience in games. arXiv. [CrossRef]
- Hao, F., Zhang, H., Wang, B., Liao, L., Liu, Q., & Cambria, E. (2024). EmpathyEar: An open-source avatar multimodal empathetic chatbot. arXiv. [CrossRef]
- Hassan, A., Ahmad, S. G., Iqbal, T., Munir, E. U., Ayyub, K., & Ramzan, N. (2025). Enhanced model for gestational diabetes mellitus prediction using a fusion technique of multiple algorithms with explainability. International Journal of Computational Intelligence Systems, 18(1). [CrossRef]
- Hauke, G., Lohr-Berger, C., & Shafir, T. (2024). Emotional activation in a cognitive behavioral setting: Extending the tradition with embodiment. Frontiers in Psychology, 15, 1409373. [CrossRef]
- Hegde, K., & Jayalath, H. (2025). Emotions in the loop: A survey of affective computing for emotional support. arXiv. [CrossRef]
- Herbuela, V. R. D. M., & Nagai, Y. (2025). Realtime multimodal emotion estimation using behavioral and neurophysiological data. arXiv. [CrossRef]
- Huang, Z., Chiang, C., Chen, J., Chen, Y.-C., Chung, H.-L., Cai, Y., & Hsu, H.-C. (2023). A study on computer vision for facial emotion recognition. Scientific Reports, 13(1). [CrossRef]
- Islam, R., & Bae, S. W. (2024). Revolutionizing mental health support: An innovative affective mobile framework for dynamic, proactive, and context-adaptive conversational agents. arXiv. [CrossRef]
- Janhonen, J. (2023). Socialisation approach to AI value acquisition: Enabling flexible ethical navigation with built-in receptiveness to social influence. AI and Ethics. [CrossRef]
- Lakoff, G., & Johnson, M. (1999). Philosophy in the flesh: The embodied mind and its challenge to Western thought. Basic Books.
- Lee, S., & Kim, J.-H. (2023). Video multimodal emotion recognition system for real world applications. arXiv. [CrossRef]
- Li, Y., Sun, Q., Schlicher, M., Lim, Y. W., & Schuller, B. W. (2025). Artificial emotion: A survey of theories and debates on realising emotion in artificial intelligence. arXiv. [CrossRef]
- Liao, Y., Gao, Y., Wang, F., Zhang, L., Xu, Z., & Wu, Y. (2025). Emotion recognition with multiple physiological parameters based on ensemble learning. Scientific Reports, 15(1). [CrossRef]
- Mayer, J. D., & Salovey, P. (1997). What is emotional intelligence? In P. Salovey & D. Sluyter (Eds.), Emotional development and emotional intelligence: Educational implications (pp. 3–31). Basic Books.
- McKee, K. R., Bai, X., & Fiske, S. T. (2023). Humans perceive warmth and competence in artificial intelligence. iScience, 26(8), 107256. [CrossRef]
- Mehrabian, A. (1971). Silent messages. Wadsworth.
- Modi, K., & Devaraj, L. (2022). Advancements in biometric technology with artificial intelligence. arXiv. [CrossRef]
- Narimisaei, J., Naeim, M., Imannezhad, S., Samian, P., & Sobhani, M. (2024). Exploring emotional intelligence in AI systems: A comprehensive analysis of emotion recognition and response mechanisms. Annals of Medicine and Surgery, 86(8), 4657. [CrossRef]
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (pp. 689–696). Omnipress.
- Nyamathi, A., Dutt, N., Lee, J., Rahmani, A. M., Rasouli, M., Krogh, D., Krogh, E., Sultzer, D. L., Rashid, H., Liaqat, H., Jawad, R., Azhar, F., Ali, A., Qamar, B., Bhatti, T. Y., Khay, C., Ludlow, J., Gibbs, L., Rousseau, J., … Brunswicker, S. (2024). Establishing the foundations of emotional intelligence in care companion robots to mitigate agitation among high-risk patients with dementia: Protocol for an empathetic patient-robot interaction study. JMIR Research Protocols, 13, e55761. [CrossRef]
- Pan, L., & Liu, W. (2024). Adaptive language-interacted hyper-modality representation for multimodal sentiment analysis. International Journal of Advanced Computer Science and Applications, 15(7). [CrossRef]
- Picard, R. W. (1997). Affective computing. MIT Press. [CrossRef]
- Porges, S. W. (2011). The polyvagal theory: Neurophysiological foundations of emotions, attachment, communication, and self-regulation. W. W. Norton & Company.
- Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125. [CrossRef]
- Schröder, M., & Cowie, R. (2005). Issues in emotion-oriented computing. In R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. F. Papageorgiou, S. Kollias, & S. O. O’Donnell (Eds.), Emotion-Oriented Systems (pp. 3–18). Springer. [CrossRef]
- Spitale, M., Winkle, K., Barakova, E., & Güneş, H. (2024). Guest editorial: Special issue on embodied agents for wellbeing. International Journal of Social Robotics, 16(5), 833–838. [CrossRef]
- Suganya, R., Narmatha, M., & Kumar, S. V. (2024). An emotionally intelligent system for multimodal sentiment classification. Indian Journal of Science and Technology, 17(42), 4386–4396. [CrossRef]
- Tiwari, A., & Falk, T. H. (2019). Fusion of motif- and spectrum-related features for improved EEG-based emotion recognition. Computational Intelligence and Neuroscience, 2019, 1–15. [CrossRef]
- Vistorte, A. O. R., Deroncele-Acosta, Á., Ayala, J. L. M., Barrasa, Á., López-Granero, C., & Martí-González, M. (2024). Integrating artificial intelligence to assess emotions in learning environments: A systematic literature review. Frontiers in Psychology, 15, 1387089. [CrossRef]
- Wang, Y., Song, W., Tao, W., Liotta, A., Yang, D., Li, X., Gao, S., Sun, Y., Ge, W., Zhang, W., & Zhang, W. (2022). A systematic review on affective computing: Emotion models, databases, and recent advances. Information Fusion, 19, 29–55. [CrossRef]
- Yang, P., Liu, N., Liu, X., Shu, Y., Ji, W., Ren, Z., Sheng, J., Yu, M., Yi, R., Zhang, D., & Liu, Y. (2024). A multimodal dataset for mixed emotion recognition. Scientific Data, 11(1). [CrossRef]
- Zhang, L., Qian, Y., Arandjelović, O., & Zhu, A. (2023). Multimodal latent emotion recognition from micro-expression and physiological signals. arXiv. [CrossRef]
- Zhao, S., Jia, G., Yang, J., Ding, G., & Keutzer, K. (2021). Emotion recognition from multiple modalities: Fundamentals and methodologies. IEEE Signal Processing Magazine, 38(6), 59–73. [CrossRef]
- Zhi, H., Hong, H., Cai, X., Li, L., Ren, Z., Xiao, M., Jiang, H., & Wang, X. (2024). Skew-pair fusion theory: An interpretable multimodal fusion framework. Research Square. [CrossRef]
- Zhu, Y., Han, L., Jiang, G., Zhou, P., & Wang, Y. (2025). Hierarchical MoE: Continuous multimodal emotion recognition with incomplete and asynchronous inputs. arXiv. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
