Submitted:
14 August 2025
Posted:
14 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Spatiotemporal Feature Extraction: A 3D-ResNet-18 network—a convolutional architecture designed for capturing spatiotemporal patterns—is employed to extract motion-aware features from video sequences and identify the onset, apex, and offset frames of micro-expressions.
- Structural Dependency Modeling: A Graph Convolutional Network (GCN) is constructed using a directed and asymmetric adjacency matrix derived from AU co-occurrence probabilities, enabling the model to capture interdependent facial activations.
- 3D Animation Synthesis: Recognized expression sequences are converted into continuous animation control curves, driving the facial rig in Unreal Engine, a widely used real-time 3D rendering engine, to generate fine-grained, high-fidelity facial animations.
2. Related Work
2.1. Trends in Micro-Expression Recognition
2.2. AU Detection
2.3. Micro-Expression Modeling in Virtual Humans
2.4. Comparative Analysis and Methodological Innovations
2.4.1. Methodological Innovations
- Graph-Based Modeling of Action Units: Each AU is represented as a node in an asymmetric adjacency matrix, enabling the model to capture the inherent structural dependencies among facial muscles. A Graph Convolutional Network (GCN) built upon AU co-occurrence statistics is employed to strengthen the structural sensitivity and generalization capability of the recognition process.
- Joint Spatiotemporal Feature Extraction: To simultaneously capture spatial configurations and temporal dynamics, a 3D-ResNet-18 backbone is adopted. To enhance the modeling of subtle temporal variations, the backbone is further integrated with an Enhanced Long-term Recurrent Convolutional Network (ELRCN), thereby improving the sensitivity to transient and low-intensity motion cues, which are critical for micro-expression analysis.
- Emotion-Driven Animation Mapping Mechanism: The extracted AU activation patterns are mapped into parameterized facial muscle trajectories via continuous motion curves, which in turn drive the expression synthesis module of the virtual human. This mapping strategy enables the generation of contextually appropriate and emotionally expressive facial animations, exceeding the expressive capacity of traditional template-based methods.
2.4.2. Architectural Innovations
- Spatiotemporal Feature Extraction: Given that micro-expressions (MEs) are brief (<0.5s) and subtle in amplitude, we utilize a lightweight yet effective backbone—3D-ResNet-18—for end-to-end modeling of video segments. The 3D Convolutional Neural Network (3D-CNN) slides jointly across spatial (x, y) and temporal (t) dimensions, enabling the network to perceive fine-grained motion variations between consecutive frames. This makes it particularly suitable for temporally sensitive and low-amplitude signals such as MEs [27]. Additionally, the Enhanced Long-term Recurrent Convolutional Network (ELRCN) [32] incorporates two learning modules to strengthen both spatial and temporal representations.
- AU Relationship Modeling: AUs, as defined in the Facial Action Coding System (FACS), are physiologically interpretable units that encode facial muscle movements and exhibit cross-subject consistency. Therefore, they are widely used in micro-expression analysis and synthesis. Liu et al. [33] manually defined 13 facial regions and used 3D filters to perform convolution over feature maps for AU localization. Inspired by this approach, we introduce a GCN to model co-occurrence relationships between AUs. Each AU is represented as a node in a graph, and the edge weights are defined based on empirical co-occurrence probabilities.
3. Framework and Methods
3.1. Overall System Architecture
3.2. Temporal Segmentation
3.3. AU Graph Modeling with Emotion Layers
3.4. Animation Synthesis with Diverse Emotional Profiles
- Invoke the FindRow method to match the input expression curve with entries in the AU dictionary and extract the corresponding RowValue (a time–displacement key-value pair).
- Based on RowValue->Time and RowValue->Disp, create animation keyframes via FKeyHandle and append them to the animation curve.
- Use SetKeyTangentMode to set the tangent mode to automatic (RCTMAuto), and call SetKeyInterpMode to set the interpolation mode to cubic (RCIMCubic), improving transition quality between keyframes.
4. Validationk
4.1. System Performance Evaluation
4.1.1. Dataset and Experimental Settings
4.1.2. Implementation Details
| Model | k-fold Acc. (%) | LOSO Acc. (%) |
|---|---|---|
| 3D-CNN (pre-trained) | 55.37 | 42.64 |
| 3D-CNN | 57.05 | 45.03 |
| AU_GCN_CUR | 64.11 | 45.71 |
4.2. User Subjective Perception Study
4.2.1. Participant Demographics
4.2.2. Experimental Hypothesis and Questionnaire Design
4.3. Data Analysis
4.4. Questionnaire Analysis Results
- In terms of overall perceptual ratings, Video B was rated significantly higher than Video A, indicating that the inclusion of micro-expressions had a positive impact on the overall user experience.
- In the analysis of the six basic emotions, Video B showed significantly higher ratings for fear (, ) and disgust (, ), both of which met the Bonferroni-corrected significance threshold (). This suggests that micro-expressions notably enhanced the expressiveness and perceptual salience of specific negative emotions in virtual humans.
- In the paired-sample t-tests, Video B also received significantly higher ratings than Video A in the dimensions of clarity (, ) and authenticity (, ), indicating that micro-expressions improved both the detail and credibility of facial expressions.
- Regression analysis further revealed that participants’ recognition scores of the virtual human significantly predicted their ratings of emotional expressiveness (, ). The regression model met the assumptions of residual normality and independence, indicating a strong positive correlation between recognition clarity and emotion perception.
5. Conclusion
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Queiroz, R.; Musse, S.; Badler, N. Investigating Macroexpressions and Microexpressions in Computer Graphics Animated Faces. https://doi.org/10.1162/PRES_a_00180. PRESENCE: Teleoperators and Virtual Environments, 2014, 23: 191–208. [CrossRef]
- Hou, T.; Adamo, N.; Villani, N.J. Micro-expressions in Animated Agents. https://doi.org/10.54941/ahfe1001081. Intelligent Human Systems Integration (IHSI 2022), 2022. [CrossRef]
- Li, X.; Hong, X.; Moilanen, A.; Huang, X.; Pfister, T.; Zhao, G.; Pietikäinen, M. Reading Hidden Emotions: Spontaneous Micro-expression Spotting and Recognition. https://arxiv.org/abs/1511.00423. arXiv, 2015, abs/1511.00423. arXiv:1511.00423.
- Ren, H.; Zheng, Z.; Zhang, J.; Wang, Q.; Wang, Y. Electroencephalography (EEG)-Based Comfort Evaluation of Free-Form and Regular-Form Landscapes in Virtual Reality. https://doi.org/10.3390/app14020933. Applied Sciences, 2024, 14(2): 933. [CrossRef]
- Shi, M.; Wang, R.; Zhang, L. Novel Insights into Rural Spatial Design: A Bio-Behavioral Study Employing Eye-Tracking and Electrocardiography Measures. https://doi.org/10.1371/journal.pone.0322301. PLoS ONE, 2025, 20(5): e0322301. [CrossRef]
- Yan, W.J.; Wu, Q.; Liang, J.; Chen, Y.H.; Fu, X. How Fast Are the Leaked Facial Expressions: The Duration of Micro-Expressions. Journal of Nonverbal Behavior 2013, 37, 217–230. [Google Scholar] [CrossRef]
- Hong, X.; Xu, Y.; Zhao, G. LBP-TOP: A Tensor Unfolding Revisit. In Proceedings of the Computer Vision – ACCV 2016 Workshops; Chen, C.S.; Lu, J.; Ma, K.K., Eds., Cham, 2017; pp. 513–527.
- Chaudhry, R.; Ravichandran, A.; Hager, G.; Vidal, R. Histograms of Oriented Optical Flow and Binet-Cauchy Kernels on Nonlinear Dynamical Systems for the Recognition of Human Actions. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009; pp. 1932–1939. [Google Scholar]
- Horn, B.K.P.; Schunck, B.G. Determining Optical Flow 1980.
- Koenderink, J.J. Optic Flow. Vision Research 1986, 26, 161–179. [Google Scholar] [CrossRef] [PubMed]
- Yan, W.J.; Li, X.; Wang, S.J.; Zhao, G.; Liu, Y.J.; Chen, Y.H.; Fu, X. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation. PLOS ONE January 27, 2014, 9, e86041. [CrossRef]
- Khor, H.Q.; See, J.; Phan, R.C.W.; Lin, W. Enriched Long-Term Recurrent Convolutional Network for Facial Micro-Expression Recognition. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, pp. 667–674.
- Qingqing, W. Micro-Expression Recognition Method Based on CNN-LSTM Hybrid Network. International Journal of Wireless and Mobile Computing 2022, 23, 67–77. [Google Scholar]
- Zhao, Z.; Zhao, S.; Shen, J. Real-Time and Light-Weighted Unsupervised Video Object Segmentation Network. Pattern Recognition 2021, 120, 108120. [Google Scholar] [CrossRef]
- Zhao, X.; Ma, H.; Wang, R. STA-GCN: Spatio-Temporal AU Graph Convolution Network for Facial Micro-Expression Recognition. In Proceedings of the Pattern Recognition and Computer Vision; Ma, H.; Wang, L.; Zhang, C.; Wu, F.; Tan, T.; Wang, Y.; Lai, J.; Zhao, Y., Eds., Cham, 2021; pp. 80–91.
- Lo, L.; Xie, H.X.; Shuai, H.H.; Cheng, W.H. MER-GCN: Micro-Expression Recognition Based on Relation Modeling with Graph Convolutional Networks. In Proceedings of the 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR); 2020; pp. 79–84. [Google Scholar]
- A New Neuro-Optimal Nonlinear Tracking Control Method via Integral Reinforcement Learning with Applications to Nuclear Systems. Neurocomputing 2022, 483, 361–369. [CrossRef]
- Zhang, L.W.; Li, J.; Wang, S.J.; Duan, X.H.; Yan, W.J.; Xie, H.Y.; Huang, S.C. Spatio-Temporal Fusion for Macro- and Micro-Expression Spotting in Long Video Sequences. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020); 2020; pp. 734–741. [Google Scholar]
- Yap, C.H.; Yap, M.H.; Davison, A.; Kendrick, C.; Li, J.; Wang, S.J.; Cunningham, R. 3D-CNN for Facial Micro- and Macro-Expression Spotting on Long Video Sequences Using Temporal Oriented Reference Frame. In Proceedings of the Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 2022; MM ’22, pp. 7016–7020.
- Lin, T.; Zhao, X.; Su, H.; Wang, C.; Yang, M. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
- EKMAN, P. Facial Action Coding System (FACS). A Human Face 2002. [Google Scholar]
- Matsuyama, Y.; Bhardwaj, A.; Zhao, R.; Romeo, O.; Akoju, S.; Cassell, J. Socially-Aware Animated Intelligent Personal Assistant Agent. In Proceedings of the Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, 2016; pp. 224–227.
- Marsella, S.; Gratch, J. University of Southern California.
- Coface: Global Credit Insurance Solutions To Protect Your Business. https://www.coface.com/, 2025.
- Wang, S.J.; Li, B.J.; Liu, Y.J.; Yan, W.J.; Ou, X.; Huang, X.; Xu, F.; Fu, X. Micro-Expression Recognition with Small Sample Size by Transferring Long-Term Convolutional Neural Network. Neurocomputing 2018, 312, 251–262. [Google Scholar] [CrossRef]
- Learning Spatiotemporal Features with 3D Convolutional Networks. https://www.researchgate.net/publication/300408292_Learning_Spatiotemporal_Features_with_3D_Convolutional_Networks.
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile; 2015; pp. 4489–4497. [Google Scholar]
- Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric Non-Local Neural Networks for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019; pp. 593–602. [Google Scholar]
- Li, H.; Lin, Z.; Shen, X.; Brandt, J.; Hua, G. A Convolutional Neural Network Cascade for Face Detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015; pp. 5325–5334. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017; pp. 4724–4733. [Google Scholar]
- Ochs, P.; Brox, T. Object Segmentation in Video: A Hierarchical Variational Approach for Turning Point Trajectories into Dense Regions. In Proceedings of the 2011 International Conference on Computer Vision; 2011; pp. 1583–1590. [Google Scholar]
- Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S. Image Caption Generation with High-Level Image Features. Pattern Recognition Letters 2019, 123, 89–95. [Google Scholar] [CrossRef]
- Liu, J.; Zheng, W.; Zong, Y. SMA-STN: Segmented Movement-Attending Spatiotemporal Network forMicro-Expression Recognition, 2020, [arXiv:cs/2010.09342].
- Yan, W.; Wu, Q.; Liu, Y.; Wang, S.; Chen, Y.; Fu, X. CASME II: An improved spontaneous micro-expression database and the baseline evaluation, 2014.
- Liu, W.; Luo, W.; Lian, D.; Gao, S. Future Frame Prediction for Anomaly Detection - A New Baseline. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018; pp. 6536–6545. [Google Scholar]
- Wilson, J.; Song, J.; Fu, Y.; Zhang, A.; Capodieci, A.; Jayakumar, P.; Barton, K.; Ghaffari, M. MotionSC: Data Set and Network for Real-Time Semantic Mapping in Dynamic Environments, 2022. https://doi.org/10.48550/arXiv.2203.07060. [CrossRef]
- Epic Games. FRichCurve API Reference - Unreal Engine 5.0 Documentation. https://dev.epicgames.com/documentation/en-us/unreal-engine/API/Runtime/Engine/Curves/FRichCurve?application_version=5.0.
- Wang, F.; Ainouz, S.; Lian, C.; Bensrhair, A. Multimodality Semantic Segmentation Based on Polarization and Color Images. Neurocomputing 2017, 253, 193–200. [Google Scholar] [CrossRef]
- Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-Based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. In Proceedings of the Proceedings of the 18th ACM International Conference on Multimodal Interaction, New York, NY, USA, 2016; ICMI ’16, pp. 445–450.
- Quang, N.V.; Chun, J.; Tokuyama, T. CapsuleNet for Micro-Expression Recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019, pp. 1–7.
- Mei, L.; Lai, J.; Feng, Z.; Xie, X. Open-World Group Retrieval with Ambiguity Removal: A Benchmark. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); 2021; pp. 584–591. [Google Scholar]
- Yuhong, H. Research on Micro-Expression Spotting Method Based on Optical Flow Features. In Proceedings of the Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 2021; MM ’21, pp. 4803–4807.
- Cohn, J.; Zlochower, A.; Lien, J.; Kanade, T. Feature-Point Tracking by Optical Flow Discriminates Subtle Differences in Facial Expression. In Proceedings of the Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition; 1998; pp. 396–401. [Google Scholar]
- Li, M.; Zha, Q.; Wu, H. Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition, 2025, [arXiv:cs/2502.21004].
- Yang, B.; Wu, J.; Ikeda, K.; Hattori, G.; Sugano, M.; Iwasawa, Y.; Matsuo, Y. Deep Learning Pipeline for Spotting Macro- and Micro-Expressions in Long Video Sequences Based on Action Units and Optical Flow. Pattern Recognition Letters 2023, 165, 63–74. [Google Scholar] [CrossRef]
- Yu, W.W.; Jiang, J.; Li, Y.J. LSSNet: A Two-Stream Convolutional Neural Network for Spotting Macro- and Micro-Expression in Long Videos. In Proceedings of the Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 2021; MM ’21, pp. 4745–4749.
- Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2 ed.; Routledge: New York, 2013.






| Layer | Kernel Size | Output Size |
|---|---|---|
| Conv1 | , stride | |
| ResBlock1 | ||
| ResBlock2 | ||
| ResBlock3 | ||
| ResBlock4 | ||
| Global Avg. Pooling | Spatial-temporal pooling |
| Category | Method | Accuracy (%) | F1-score (%) |
|---|---|---|---|
| Hand-crafted | MDMD [38] | 57.07 | 23.50 |
| SP-FD [42] | 21.31 | 12.43 | |
| OF-FD [43] | 37.82 | 35.34 | |
| LBP-TOP [7] | 56.98 | 42.40 | |
| LOCP-TOP [27] | 45.53 | 42.25 | |
| Deep-learning | CapsuleNet [40] | 56.80 | 34.70 |
| MER-GCN [41] | 54.40 | 30.30 | |
| CNN + LSTM [39] | 60.98 | 32.50 | |
| SOFTNe [44] | 24.10 | 20.22 | |
| Concat-CNN [45] | 25.05 | 20.19 | |
| LSSNet [46] | 37.70 | 32.50 | |
| AU_GCN_CUR | 64.11 | 42.93 |
| Emotion | Mean A | Mean B | Diff (A-B) | p-value | Cohen’s d |
|---|---|---|---|---|---|
| Anger | 3.6931 | 3.8232 | -0.1301 | 0.207 | 0.658 |
| Disgust | 3.6362 | 3.8293 | -0.1931 | 0.096 | 0.739 |
| Fear | 3.5305 | 3.8150 | -0.2846 | 0.013 | 0.723 |
| Happiness | 3.4939 | 3.5874 | -0.0935 | 0.471 | 0.829 |
| Sadness | 3.5528 | 3.7378 | -0.1850 | 0.146 | 0.811 |
| Surprise | 3.6667 | 3.5854 | 0.0813 | 0.523 | 0.813 |
| Overall | 3.4463 | 3.7027 | -0.2564 | 0.005 | 0.048 |
| Dimension | Mean A | Mean B | Diff (A-B) | p-value | Cohen’s d |
|---|---|---|---|---|---|
| Anger | 3.6931 | 3.8232 | -0.1301 | 0.068 | 0.658 |
| Disgust | 3.6362 | 3.8293 | -0.1931 | 0.008 | 0.739 |
| Fear | 3.5305 | 3.8150 | -0.2846 | <0.001 | 0.723 |
| Happiness | 3.4939 | 3.5874 | -0.0935 | 0.325 | 0.829 |
| Sadness | 3.5528 | 3.7378 | -0.1850 | 0.044 | 0.811 |
| Surprise | 3.6667 | 3.5854 | 0.0813 | 0.425 | 0.813 |
| Dimension | Mean A | Mean B | Diff (A-B) | p-value | Cohen’s d |
|---|---|---|---|---|---|
| Clarity | 3.6545 | 3.8130 | -0.1585 | <0.001 | 0.678 |
| Naturalness | 3.5730 | 3.6220 | -0.0488 | 0.460 | 0.595 |
| Authenticity | 3.5600 | 3.7500 | -0.1950 | 0.004 | 0.686 |
| Variable | Coefficient | Std. Error | t-value | p-value |
|---|---|---|---|---|
| Intercept (Constant) | 2.446 | 0.232 | 10.558 | <0.001 |
| Recognition_Mean | 0.336 | 0.061 | 5.542 | <0.001 |
| Model Summary | R² = 0.277 | Adj. R² = 0.268 | F = 30.72 | p < 0.001 |
| Shapiro-Wilk Test | W = 0.9926 | p = 0.9235 | (Residuals are normally distributed) | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).