Submitted:
13 June 2025
Posted:
17 June 2025
You are already at the latest version
Abstract
Keywords:

1. Introduction
- Proposes the first compound emotion recognition framework that integrates voice, text, and facial features, supporting multi-emotion co-occurrence and weighted emotion perception.
- Develops an AU-driven 3D facial expression mixer combined with a state machine mechanism, enabling dynamic synthesis and control of blended emotional expressions.
- Designs an innovative generation agent module that bridges GPT-based semantic analysis with 3D expression control, enabling a real-time “speech-to-expression” response system.
- Validates the proposed system on the AffectNet dataset and Unreal Engine platform, demonstrating superior performance in recognition accuracy, expression naturalness, and interaction responsiveness.
2. Related Work
2.1. Facial Expression Mapping for Digital Humans
2.2. Selection of Backbone Network for Facial Emotion Recognition
2.3. Facial Animation Generation and Real-Time Interaction Technologies
3. Framework
3.1. ResNet18-Based Facial Expression Recognition Backbone
3.2. AU Extraction and Mapping
- Lookup Matching: The FindRow() method is used to retrieve the AU motion template corresponding to the current expression category from the AU mapping dictionary, extracting the time series values (Time) and motion displacement magnitudes (Disp);
- Keyframe Construction: Based on the extracted values, keyframe handles (FKeyHandle) are created and automatically added to the curve trajectory;
- Curve Interpolation Control: The interpolation mode of each keyframe is set to cubic by calling the SetKeyInterpMode() method, enhancing the smoothness and naturalness of the curve to support continuous motion transitions;
3.3. Emotional Semantic Parsing and Expression Visualization
4. Validation
4.1. Algorithm Architecture Output Analysis
4.2. Subjective Questionnaire Experiment Design
4.2.1. Stimulus Material Generation Method
4.2.2. Participants and Task Arrangement
4.3. Experimental Method
4.4. Subjective Questionnaire Analysis
4.4.1. Main Effect of System Type
4.4.2. Paired-Sample Comparison
4.4.3. Effects of Age and Technology Acceptance
4.4.4. Inter-Dimensional Correlation Analysis
4.5. Objective Analysis of AU Activation Patterns
4.5.1. Total Activation Energy
4.5.2. Spatial Distribution Pattern
4.5.3. Temporal Evolution Structure
4.6. Correlation Analysis Between Subjective Ratings and Objective AU Features
4.7. Conclusion
4.7.1. Key Findings and Contributions
4.7.2. Limitations and Future Work
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, X.; Zhang, S.; Wang, X.; Zhang, G. Multimodal Emotion Recognition Integrating Affective Speech with Facial Expression 2014.
- Liu, H.; Zhu, Z.; Iwamoto, N.; Peng, Y.; Li, Z.; Zhou, Y.; Bozkurt, E.; Zheng, B. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis, 2022, [arXiv:cs/2203.05297].
- Liu, H.; Zhu, Z.; Becherini, G.; Peng, Y.; Su, M.; Zhou, Y.; Zhe, X.; Iwamoto, N.; Zheng, B.; Black, M.J. EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1144–1154.
- Kim, J.O. Impact Analysis of Nonverbal Multimodals for Recognition of Emotion Expressed Virtual Humans 2012. 13, 9–19.
- Parke, F.I. Computer Generated Animation of Faces. In Proceedings of the Proceedings of the ACM Annual Conference - Volume 1, New York, NY, USA, 1972; Vol. 1, ACM ’72, pp. 451–457.
- Facial Action Coding System. Wikipedia 2025.
- Waters, K. A Muscle Model for Animation Three-Dimensional Facial Expression. SIGGRAPH Comput. Graph. 1987, 21, 17–24. [Google Scholar] [CrossRef]
- Terzopoulos, D.; Waters, K. Physically-Based Facial Modelling, Analysis, and Animation. The Journal of Visualization and Computer Animation 1990, 1, 73–80. [Google Scholar] [CrossRef]
- Yu, S. Segmentation Induced by Scale Invariance. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1; 2005; pp. 444–451. [Google Scholar]
- Baraff, D.; Witkin, A. Large Steps in Cloth Simulation. In Proceedings of the Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, New York, NY, USA, 1998; SIGGRAPH ’98, pp. 43–54.
- Haykin, S.; Sellathurai, M.; de Jong, Y.; Willink, T. Turbo-MIMO for Wireless Communications. IEEE Communications Magazine 2004, 42, 48–53. [Google Scholar] [CrossRef]
- Raouzaiou, A.; Tsapatsoulis, N.; Karpouzis, K.; Kollias, S. Parameterized Facial Expression Synthesis Based on MPEG-4. EURASIP Journal on Advances in Signal Processing 2002, 2002, 521048. [Google Scholar] [CrossRef]
- Pourebadi, M.; Pourebadi, M. MLP Neural Network Based Approach for Facial Expression Analysis 2016.
- Ramos, P.L.; Louzada, F.; Ramos, E. An Efficient, Closed-Form MAP Estimator for Nakagami- m Fading Parameter. IEEE Communications Letters 2016, 20, 2328–2331. [Google Scholar] [CrossRef]
- Aladeemy, M.; Tutun, S.; Khasawneh, M.T. A New Hybrid Approach for Feature Selection and Support Vector Machine Model Selection Based on Self-Adaptive Cohort Intelligence. Expert Systems with Applications 2017, 88, 118–131. [Google Scholar] [CrossRef]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing 2019, 10, 18–31. [Google Scholar] [CrossRef]
- Islamadina, R.; Saddami, K.; Oktiana, M.; Abidin, T.F.; Muharar, R.; Arnia, F. Performance of Deep Learning Benchmark Models on Thermal Imagery of Pain through Facial Expressions. 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT) 2022, pp. 374–379.
- Savchenko, A. Facial Expression and Attributes Recognition Based on Multi-Task Learning of Lightweight Neural Networks. 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY) 2021, pp. 119–124.
- Chen, Y. Facial Expression Recognition Based on ResNet and Transfer Learning. Applied and Computational Engineering 2023.
- Jiang, S.; Xu, X.; Liu, F.; Xing, X.; Wang, L. CS-GResNet: A Simple and Highly Efficient Network for Facial Expression Recognition. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022, pp. 2599–2603.
- Echoukairi, H.; Ghmary, M.E.; Ziani, S.; Ouacha, A. Improved Methods for Automatic Facial Expression Recognition. Int. J. Interact. Mob. Technol. 2023, 17, 33–44. [Google Scholar] [CrossRef]
- Ngo, T.; Yoon, S. Facial Expression Recognition on Static Images 2019. pp. 640–647.
- Bänziger, T.; Mortillaro, M.; Scherer, K.R. Introducing the Geneva Multimodal Expression Corpus for Experimental Research on Emotion Perception. Emotion 2012, 12, 1161–1179. [Google Scholar] [CrossRef] [PubMed]
- Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn-Kanade Dataset (CK+): A Complete Dataset for Action Unit and Emotion-Specified Expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops; 2010; pp. 94–101. [Google Scholar]
- Mavadati, S.M.; Mahoor, M.H.; Bartlett, K.; Trinh, P.; Cohn, J.F. DISFA: A Spontaneous Facial Action Intensity Database. IEEE Transactions on Affective Computing 2013, 4, 151–160. [Google Scholar] [CrossRef]
- List of Facial Expression Databases. Wikipedia 2025.
- HapFACS 3.0: FACS-Based Facial Expression Generator for 3D Speaking Virtual Characters. https://www.researchgate.net/publication/277349579_HapFACS_30_FACS-based_facial_expression_generator_for_3D_speaking_virtual_characters.
- Krumhuber, E.G.; Tamarit, L.; Roesch, E.B.; Scherer, K.R. FACSGen 2.0 Animation Software: Generating Three-Dimensional FACS-Valid Facial Expressions for Emotion Research. Emotion 2012, 12, 351–363. [Google Scholar] [CrossRef] [PubMed]
- Pumarola, A.; Agudo, A.; Martinez, A.M.; Sanfeliu, A.; Moreno-Noguer, F. GANimation: Anatomically-Aware Facial Animation from a Single Image, 2018, [arXiv:cs/1807.09251].
- Animation Curve Editor in Unreal Engine | Unreal Engine 5.5 Documentation | Epic Developer Community. https://dev.epicgames.com/documentation/en-us/unreal-engine/animation-curve-editor-in-unreal-engine.
- Pauletto, S.; Balentine, B.; Pidcock, C.; Jones, K.; Bottaci, L.; Aretoulaki, M.; Wells, J.; Mundy, D.; Balentine, J.R. Exploring Expressivity and Emotion with Artificial Voice and Speech Technologies. Logopedics Phoniatrics Vocology 2013, 38, 115–125. [Google Scholar] [CrossRef] [PubMed]
- Bauer, N.; Preisig, M.; Volk, M. Offensiveness, Hate, Emotion and GPT: Benchmarking GPT3.5 and GPT4 as Classifiers on Twitter-Specific Datasets 2024.
- Ma, Z.; Wu, W.; Zheng, Z.; Guo, Y.; Chen, Q.; Zhang, S.; Chen, X. Leveraging Speech PTM, Text LLM, And Emotional TTS For Speech Emotion Recognition. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, pp. 11146–11150.
- Coppini, S.; Lucifora, C.; Vicario, C.; Gangemi, A. Experiments on Real-Life Emotions Challenge Ekman’s Model. Scientific Reports 2023, 13. [Google Scholar] [CrossRef] [PubMed]
- Sabini, J.; Silver, M. Ekman’s Basic Emotions: Why Not Love and Jealousy? Cognition and Emotion 2005, 19, 693–712. [Google Scholar] [CrossRef]
- Lian, Z.; Sun, L.; Sun, H.; Chen, K.; Wen, Z.; Gu, H.; Chen, S.; Liu, B.; Tao, J. GPT-4V with Emotion: A Zero-Shot Benchmark for Multimodal Emotion Understanding. ArXiv 2023, abs/2312.04293.








| Model | F1-score | Accuracy | Inference Time (ms) | Model Size (MB) |
|---|---|---|---|---|
| MLP [13] | 0.028 | 0.125 | 2.16 | 24.02 |
| MobileNetV2 [17,18] | 0.505 | 0.504 | 5.85 | 8.52 |
| ResNet18 V3-Opt [19,20] | 0.625 | 0.626 | 1.31 | 42.65 |
| ResNet50 [21,22] | 0.539 | 0.536 | 7.76 | 89.70 |
| Layer | Kernel Configuration | Output Dimension |
|---|---|---|
| Input | RGB image: | |
| Conv1 | , 64 filters, stride | |
| MaxPooling (optional) | , stride | |
| ResBlock1 | , 64 × 2 | |
| ResBlock2 | , 128 × 2 | |
| ResBlock3 | , 256 × 2 | |
| ResBlock4 | , 512 × 2 (fine-tuned) | |
| Global Avg. Pooling | AdaptiveAvgPool2d | |
| Fully Connected (FC) | Linear(512 → 7) | 7 logits |
| Softmax Activation | Softmax over 7 logits | 7-dimensional probability |
| Output | argmax(Softmax)→ predicted label | Label + confidence vector |
| Variant | Data Augmentation | Loss Function | Optimizer + Scheduler | Train Acc (%) | Val Acc (%) | Val F1 |
|---|---|---|---|---|---|---|
| ResNet18 v1 | Flip, rotate, jitter | CrossEntropy | Adam | 90.44 | 54.90 | 0.5438 |
| ResNet18 v2 | Same as v1 | CrossEntropy | Adam | 59.72 | 56.14 | 0.5590 |
| ResNet18 v3 | Flip, rotate, affine, color | CE + Label Smoothing | AdamW + CosineAnnealing | 69.49 | 62.57 | 0.6254 |
| Ratings | Mean (Static) | Mean (Dynamic) | F | p-value | Partial | Significant Interaction |
|---|---|---|---|---|---|---|
| Naturalness | 3.31 ± 0.94 | 4.02 ± 0.80 | 42.21 | <0.001 | 0.507 | SystemType × Age (p = 0.008) |
| Congruence | 3.41 ± 0.89 | 4.12 ± 0.73 | 34.78 | <0.001 | 0.459 | – |
| Layering | 3.33 ± 0.91 | 4.09 ± 0.72 | 39.65 | <0.001 | 0.492 | – |
| Satisfaction | 3.31 ± 0.93 | 3.95 ± 0.82 | 21.45 | <0.001 | 0.344 | – |
| Dimension | Static (M ± SD) | Dynamic (M ± SD) | t | df | p (2-tailed) | Cohen’s d | 95% CI (d) |
|---|---|---|---|---|---|---|---|
| Naturalness | 3.54 ± 0.54 | 3.94 ± 0.41 | -4.43 | 50 | <0.001 | -0.62 | [-0.92, -0.32] |
| Congruence | 3.44 ± 0.61 | 3.80 ± 0.61 | -4.70 | 50 | <0.001 | -0.66 | [-0.96, -0.35] |
| Layering | 3.20 ± 0.42 | 4.08 ± 0.43 | -10.62 | 50 | <0.001 | -1.49 | [-1.88, -1.08] |
| Satisfaction | 3.45 ± 0.41 | 4.11 ± 0.36 | -9.43 | 50 | <0.001 | -1.32 | [-1.69, -0.94] |
| Emotion | AU Name | Dyn. Energy | Stat. Energy | p-value | Cohen’s d |
|---|---|---|---|---|---|
| Angry | mouthLipsTogetherDL | 28.15 | 6.69 | < .001 | 1.64 |
| mouthLipsTogetherDR | 28.15 | 6.69 | < .001 | 1.64 | |
| mouthLipsTogetherUL | 24.99 | 5.73 | < .001 | 1.84 | |
| mouthLipsTogetherUR | 24.99 | 5.73 | < .001 | 1.84 | |
| browRaiseOuterL | 20.67 | 37.94 | < .001 | -2.11 | |
| Happiness | mouthCornerPullR | 25.71 | 34.76 | < .001 | -1.18 |
| mouthCornerPullL | 23.37 | 31.70 | < .001 | -1.18 | |
| browRaiseInL | 23.85 | 25.79 | 0.005 | -0.37 | |
| browRaiseInR | 22.20 | 23.55 | 0.029 | -0.29 | |
| browRaiseOuterL | 22.04 | 23.63 | 0.010 | -0.34 | |
| Sadness | browRaiseOuterL | 12.10 | 0.00 | < .001 | 1.43 |
| browLateralL | 10.78 | 27.18 | < .001 | -1.96 | |
| browDownL | 9.92 | 26.80 | < .001 | -1.96 | |
| browRaiseInL | 9.35 | 16.22 | < .001 | -1.89 | |
| mouthLipsTogetherDL | 34.19 | 36.93 | 0.007 | -0.36 | |
| Neutral | mouthUpperLipRollInL | 6.61 | 0.56 | < .001 | 1.39 |
| mouthUpperLipRollInR | 6.61 | 0.57 | < .001 | 1.39 | |
| mouthCornerDepressR | 6.30 | 0.00 | < .001 | 1.41 | |
| mouthLipsTogetherDL | 33.45 | 31.10 | < .001 | 0.48 | |
| mouthLipsTogetherDR | 33.45 | 31.10 | < .001 | 0.48 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).