Submitted:
14 March 2025
Posted:
18 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
Audio Cues
Facial Expressions
Audio-Visual Expression Detection
Attention Mechanisms for Expression Recognition
3. Methodology
3.1. Overall Framework
3.2. Transformer Architecture
3.2.1. Self-Attention Mechanism
3.2.2. Multi-Head Attention
3.2.3. Positional Encoding
3.3. ExpTrm Architecture
3.3.1. Modal Encoders
Audio Encoder
Visual Encoder
3.3.2. Cross-Modal Attention Layers
3.3.3. Feature Fusion and Prediction
3.4. Training Procedure
3.5. Handling Missing Modalities
3.6. Remarks on ExpTrm
- Dynamic Feature Integration: The cross-modal attention layers enable dynamic weighting and integration of audio and visual features, allowing the model to prioritize relevant information based on the contextual interplay between modalities.
- Scalability: Leveraging the transformer architecture facilitates scalability to longer sequences and larger datasets, enhancing the model’s ability to capture complex temporal dependencies.
- Robustness to Missing Data: The inherent design of cross-modal attention provides resilience against missing or occluded modalities, ensuring consistent performance in diverse scenarios.
- State-of-the-Art Performance: Empirical evaluations demonstrate that ExpTrm achieves superior performance in predicting arousal and valence values, setting new benchmarks on the Aff-Wild2 dataset.
- Incorporation of Additional Modalities: Integrating other sensory inputs, such as physiological signals (e.g., heart rate, skin conductance), could further enhance emotion recognition accuracy.
- Real-Time Processing: Optimizing the model for real-time expression detection would expand its applicability in interactive systems and live monitoring scenarios.
- Personalization: Developing personalized models that adapt to individual differences in emotional expression could improve performance in diverse user populations.
- Robustness to Adversarial Conditions: Enhancing the model’s resilience to noisy or adversarial inputs would ensure reliable performance in real-world environments.
4. Experimental Setup
4.1. Dataset
4.2. Features and Preprocessing
Video
Audio
4.3. Transformer Model Architecture
4.4. Architecture and Implementation
- Audio and Video Encoders: The model comprises two separate encoders—one for audio and one for video. Each encoder consists of multiple self-attention layers, enabling the extraction of relevant features within each modality independently.
- Cross-Modal Attention Layers: Following the individual encoders, the architecture incorporates two cross-modal attention layers. These layers utilize dot-product attention mechanisms to facilitate the interaction between audio and visual features, allowing the model to dynamically prioritize and integrate information from both modalities.
- Feature Fusion: The outputs from the cross-modal attention layers are combined using a weighted sum approach, where scalar trainable parameters and determine the contribution of each modality to the final fused feature representation.
- Prediction Layer: The fused features are fed into a dense output layer, which regresses the continuous valence and arousal values for each frame, enabling precise emotion detection.
4.5. Baseline Models
- 1
- AffWildNet + Static (V): As introduced by Kollias et al. [47], this baseline model is pretrained on the Aff-Wild database [47], the predecessor to Aff-Wild2, focusing exclusively on the video modality. The static variant processes each image independently using the VGGFace architecture, which comprises multiple convolutional layers followed by a fully connected (FC) layer and an output layer for predictions. We experiment with two configurations for the FC layer: one with 2000 dimensions and another with 4096 dimensions. It is important to note that our proposed ExpTrm model builds upon the VGGFace architecture used in this baseline, thereby sharing identical preprocessing steps and leveraging similar feature extraction mechanisms.
- 2
- AffWildNet + Dynamic (V): Extending the static approach, Kollias et al. [47] propose a dynamic model that processes sequences of images to capture temporal dynamics inherent in facial expressions. This dynamic variant maintains the same FC layer dimension of 4096 and incorporates two Gated Recurrent Unit (GRU) layers, each consisting of 128 nodes, to model temporal dependencies across the sequence. We adopt the same sequence length of 100 frames used in our transformer-based approach, ensuring consistency in temporal resolution and enabling a fair comparison between the models.
- 3
- RNN (A): Serving as the audio modality baseline, this model employs two GRU layers stacked on top of the extracted Low-Level Descriptors (LLDs). The GRU layers are tasked with capturing temporal dependencies in the audio signal, facilitating the prediction of arousal and valence based solely on auditory cues. This unimodal approach allows us to assess the contribution of the audio modality in isolation.
- 4
- VGGFace-RNN (V): This model mirrors the architecture of AffWildNet + Dynamic (V), utilizing the VGGFace network for visual feature extraction followed by two GRU layers for temporal modeling. Additionally, it is trained on the Aff-Wild2 dataset, enhancing its ability to generalize across the diverse conditions present in real-world scenarios. This unimodal approach provides a benchmark for assessing the performance of the visual modality alone.
- 5
- VGGFace-RNN (A + V): Representing the audio-visual baseline, this model concatenates the outputs from the audio-only RNN (A) and video-only VGGFace-RNN (V) models. The concatenated features are then processed through two additional GRU layers, followed by a dense layer for the final predictions. This architecture facilitates the integration of audio and visual information, providing a baseline for multimodal fusion against which ExpTrm is compared. The concatenation approach serves as a straightforward method for multimodal integration, allowing us to evaluate the effectiveness of more sophisticated fusion techniques employed by ExpTrm.
- 6
- NISL (V): Deng et al. [48] introduce pretrained Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) models trained within a multitask framework. These models are designed to simultaneously predict emotional categories, facial action units, and emotional attributes, leveraging external datasets to balance the distribution of emotions within Aff-Wild2. This comprehensive approach enhances the model’s capacity to generalize and accurately recognize a wide array of emotional expressions by leveraging shared representations across related tasks.
- 7
- T (A + V): Zhang et al. [49] propose a multimodal model utilizing a 3D convolutional network coupled with a bidirectional RNN. The model is trained in a multitask setting, jointly predicting emotional categories alongside valence attributes. Although an attention mechanism is incorporated atop the recurrent layers to facilitate audio-visual feature fusion, empirical results indicate that simple concatenation of features yields superior performance. Notably, Zhang et al. [49] observe that their attention-based fusion may not have achieved full model convergence compared to concatenation-based approaches. In contrast, our preliminary experiments demonstrate that incorporating a cross-modal attention layer within ExpTrm offers tangible benefits over simple feature concatenation, thereby enhancing expression detection accuracy.
5. Experimental Results
5.1. Baseline Model Performance Analysis
5.2. Performance of the Proposed ExpTrm Model
5.3. Ablation Studies
Impact of Modality Absence
- Absence of Visual Modality (Audio-Only): When the visual modality is entirely absent (100% masking), the ExpTrm model’s performance degrades significantly, with CCC values for valence and arousal approaching zero. This substantial decline indicates that, within the Aff-Wild2 dataset, visual cues are paramount for accurate emotion recognition. The absence of facial expressions severely hampers the model’s ability to infer emotional states based solely on audio, highlighting the limited expressiveness of audio cues in this context.
- Partial Absence of Visual Modality: As the proportion of missing visual data increases, there is a corresponding linear decline in performance. This trend underscores the model’s reliance on visual information for maintaining prediction accuracy. However, even with partial visual data loss, the model retains a moderate level of performance, demonstrating some resilience to incomplete visual inputs.
- Absence of Audio Modality (Visual-Only): In contrast, the complete absence of the audio modality results in only a marginal decrease in performance, approximately 2%, for both valence and arousal predictions. This slight decline suggests that while audio cues contribute to emotion recognition, the visual modality carries the bulk of the informative signals. The model’s ability to maintain near-baseline performance in the absence of audio indicates that visual features are sufficiently robust for accurate emotion detection in most cases.
- Partial Absence of Audio Modality: When the audio data is partially missing, the model exhibits a gradual decrease in performance, albeit much less pronounced compared to the loss of visual data. This observation highlights the model’s capacity to leverage the remaining audio information to supplement the visual cues, thereby mitigating the impact of incomplete audio inputs.
Robustness and Adaptability
Overall Implications
5.4. Statistical Significance and Comparative Analysis
5.5. Discussion
Valence and Arousal Predictions
Multimodal Integration Benefits
Comparison with External State-of-the-Art Models
Model Complexity and Efficiency
6. Conclusion and Future Directions
- Extension to Expression Classification: While the current study focuses on continuous valence-arousal estimation, we plan to extend ExpTrm to handle categorical expression classification. This extension will enable the model to recognize discrete emotional states such as happiness, sadness, anger, and surprise, providing a more comprehensive understanding of user emotions.
- Incorporation of Facial Landmarks: To harness additional visual information, we intend to integrate facial landmarks as supplementary inputs. By incorporating precise facial feature points, ExpTrm can achieve a finer-grained analysis of facial expressions, potentially improving the detection of subtle emotional nuances.
- Robust Training with Missing Data: Recognizing that real-world scenarios often involve incomplete or missing data, we will investigate various training strategies to enhance ExpTrm’s robustness. Techniques such as data augmentation, imputation methods, and specialized loss functions will be explored to ensure the model maintains high performance even when one of the modalities is partially or entirely unavailable.
- Integration of Additional Modalities: To further enrich the model’s emotional understanding, we plan to incorporate additional modalities such as textual data and physiological signals. Integrating text from speech transcripts or contextual information can provide deeper insights into the user’s emotional state, while physiological data like heart rate and skin conductance can offer objective measures of emotional arousal.
- Real-Time Implementation and Optimization: To facilitate the deployment of ExpTrm in interactive systems, we aim to optimize the model for real-time processing. This will involve streamlining the architecture, reducing computational overhead, and ensuring efficient memory usage without compromising accuracy.
- Personalization and Adaptability: Emotions are inherently personal and can vary significantly across individuals. Future research will focus on developing personalized models that adapt to individual differences in emotional expression. Techniques such as transfer learning and adaptive algorithms will be employed to tailor ExpTrm to specific users, enhancing its accuracy and relevance.
- Enhanced Multimodal Fusion Techniques: While cross-modal attention has proven effective, we plan to explore more sophisticated fusion strategies that can dynamically adjust the integration of modalities based on contextual factors. Approaches such as hierarchical attention mechanisms and gated fusion networks will be investigated to further improve the synergy between audio and visual inputs.
- Comprehensive Evaluation on Diverse Datasets: To validate the generalizability of ExpTrm, we will conduct evaluations on a variety of datasets encompassing different cultures, languages, and recording conditions. This will ensure that the model performs consistently across diverse populations and real-world scenarios.
References
- Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T.F. Quatieri, “A review of depression and suicide risk assessment using speech analysis,” Speech Communication, vol. 71, pp. 10–49, July 2015. [CrossRef]
- Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette, “Fear-type emotion recognition for future audio-based surveillance systems,” Speech Communication, vol. 50, no. 6, pp. 487–503, June 2008. [CrossRef]
- D. Litman and K. Forbes-Riley, “Predicting student emotions in computer-human tutoring dialogues,” in ACM Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 2004, pp. 1–8. [CrossRef]
- J.A. Russell, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, December 1980. [CrossRef]
- Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011. [CrossRef]
- R. Cowie and R.R. Cornelius, “Describing the emotional states that are expressed in speech,” Speech Communication, vol. 40, no. 1-2, pp. 5–32, April 2003. [CrossRef]
- Evangelos Sariyanidi, Hatice Gunes, and Andrea Cavallaro, “Automatic analysis of facial affect: A survey of registration, representation, and recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 6, pp. 1113–1133, 2015. [CrossRef]
- Dimitrios Kollias, Mihalis A Nicolaou, Irene Kotsia, Guoying Zhao, and Stefanos Zafeiriou, “Recognition of affect in the wild using deep neural networks,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on. IEEE, 2017, pp. 1972–1979. [CrossRef]
- H. Gunes and M. Piccardi, “A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior,” in 18th International Conference on Pattern Recognition (ICPR 2006), Hong Kong, China, August 2006, vol. 1, pp. 1148–1153. [CrossRef]
- M. Coulson, “Attributing emotion to static body postures: Recognition accuracy, confusions, and viewpoint dependence,” Journal of Nonverbal Behavior, vol. 28, no. 2, pp. 117–139, June 2004. [CrossRef]
- H.P. Martinez, Y. Bengio, and G.N. Yannakakis, “Learning deep physiological models of affect,” IEEE Computational Intelligence Magazine, vol. 8, no. 2, pp. 20–33, May 2013. [CrossRef]
- Kyung Hwan Kim, Seok Won Bang, and Sang Ryong Kim, “Emotion recognition system using short-term monitoring of physiological signals,” Medical and biological engineering and computing, vol. 42, no. 3, pp. 419–427, 2004. [CrossRef]
- A. Mehrabian, “Communication without words,” in Communication Theory, C.D. Mortensen, Ed., pp. 193–200. Transaction Publishers, New Brunswick, NJ, USA, December 2007.
- S. Parthasarathy and C. Busso, “Jointly predicting arousal, valence and dominance with multi-task learning,” in Interspeech 2017, Stockholm, Sweden, August 2017, pp. 1103–1107.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- Dimitrios Kollias and Stefanos Zafeiriou, “Aff-wild2: Extending the aff-wild database for affect recognition,” arXiv preprint arXiv:1811.07770, 2018.
- J. Lee, M. Reyes, T. Smyser, Y. Liang, and K. Thornburg, “SAfety VEhicles using adaptive interface technology (task 5) final report: Phase 1,” Technical report, The University of Iowa, Iowa City, IA, USA, November 2004.
- J.R.J. Fontaine, K.R. Scherer, E.B. Roesch, and P.C. Ellsworth, “The world of emotions is not two-dimensional,” Psychological Science, vol. 18, no. 12, pp. 1050–1057, December 2007. [CrossRef]
- Z. Aldeneh and E. Mower Provost, “Using regional saliency for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 2741–2745. [CrossRef]
- B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 emotion challenge,” in Interspeech 2009 - Eurospeech, Brighton, UK, September 2009, pp. 312–315.
- B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Muller, and S. Narayanan, “The INTERSPEECH 2010 paralinguistic challenge,” in Interspeech 2010, Makuhari, Japan, September 2010, pp. 2794–2797.
- Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, January 2009. [CrossRef]
- Caifeng Shan, Shaogang Gong, and Peter W McOwan, “Facial expression recognition based on local binary patterns: A comprehensive study,” Image and vision Computing, vol. 27, no. 6, pp. 803–816, 2009. [CrossRef]
- Ping Liu, Shizhong Han, Zibo Meng, and Yan Tong, “Facial expression recognition via a boosted deep belief network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1805–1812.
- Heechul Jung, Sihaeng Lee, Junho Yim, Sunjeong Park, and Junmo Kim, “Joint fine-tuning in deep neural networks for facial expression recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2983–2991.
- Ya Chang, Changbo Hu, Rogerio Feris, and Matthew Turk, “Manifold based analysis of facial expression,” Image and Vision Computing, vol. 24, no. 6, pp. 605–614, 2006. [CrossRef]
- . [CrossRef]
- Ali Mollahosseini, David Chan, and Mohammad H Mahoor, “Going deeper in facial expression recognition using deep neural networks,” in 2016 IEEE Winter conference on applications of computer vision (WACV). IEEE, 2016, pp. 1–10. [CrossRef]
- Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017. [CrossRef]
- Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 292–301. [CrossRef]
- Xi Ouyang, Shigenori Kawaai, Ester Gue Hua Goh, Shengmei Shen, Wan Ding, Huaiping Ming, and Dong-Yan Huang, “Audio-visual emotion recognition using deep transfer learning and multiple temporal models,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017, pp. 577–582. [CrossRef]
- Valentin Vielzeuf, Stéphane Pateux, and Frédéric Jurie, “Temporal multimodal fusion for video emotion classification in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017, pp. 569–576. [CrossRef]
- Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017. [CrossRef]
- Mingyi Chen, Xuanji He, Jing Yang, and Han Zhang, “3-d convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018. [CrossRef]
- Wang Xiaohua, Peng Muzi, Pan Lijuan, Hu Min, Jin Chunhua, and Ren Fuji, “Two-level attention with two-stage multi-task learning for facial emotion recognition,” Journal of Visual Communication and Image Representation, vol. 62, pp. 217–225, 2019. [CrossRef]
- Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting. NIH Public Access, 2018, vol. 2018, p. 2122. [CrossRef]
- Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2227–2231. [CrossRef]
- Yuanchao Li, Tianyu Zhao, and Tatsuya Kawahara, “Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning.,” 2019.
- Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque, “Integrating multimodal information in large pretrained transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 2359–2369, Association for Computational Linguistics. [CrossRef]
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. P.
- Georgios Paraskevopoulos, Srinivas Parthasarathy, Aparna Khare, and Shiva Sundaram, “Multimodal and multiresolution speech recognition with transformers,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 2381–2387, Association for Computational Linguistics. [CrossRef]
- Srinivas Parthasarathy and Shiva Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” arXiv preprint arXiv:2010.00734, 2020.
- Dimitrios Kollias and Stefanos Zafeiriou, “Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface,” arXiv preprint arXiv:1910.04855, 2019. A; d2.
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg, “Ssd: Single shot multibox detector,” Lecture Notes in Computer Science, p. 21–37, 2016. [CrossRef]
- B. Schuller, S. B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Interspeech 2013, Lyon, France, August 2013, pp. 148–152. 20 August.
- Omkar, M. Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman, “Deep face recognition,” in British Machine Vision Conference, 2015.
- Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,” International Journal of Computer Vision, pp. 1–23, 2019. [CrossRef]
- Didan Deng, Zhaokang Chen, and Bertram E. Shi, “Multitask emotion recognition with incomplete labels,” 2020.
- Yuan-Hang Zhang, Rulin Huang, Jiabei Zeng, Shiguang Shan, and Xilin Chen, “m3t: Multi-modal continuous valence-arousal estimation in the wild,” 2020.
- K. Sridhar, S. Parthasarathy, and C. Busso, “Role of regularization in the prediction of valence from speech,” in Interspeech 2018, Hyderabad, India, September 2018, pp. 941–945.
- Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang, Saeedeh Shekarpour, Johannes Hoffart, and Manohar Kaul. 2021. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network. In Proceedings of the Web Conference 2021. 1673–1685. [CrossRef]
- Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. 2019. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM. 729–738. [CrossRef]
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.
- Endri Kacupaj, Kuldeep Singh, Maria Maleshkova, and Jens Lehmann. 2022. An Answer Verbalization Dataset for Conversational Question Answerings over Knowledge Graphs. arXiv preprint arXiv:2208.06734 (2022).
- Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2021. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 459–469.
- Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, 4483–4491. Survey Track.
- Yunshi Lan and Jing Jiang. 2021. Modeling transitions of focal entities for conversational knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- Pierre Marion, Paweł Krzysztof Nowak, and Francesco Piccinno. 2021. Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021).
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, may 2015. [CrossRef]
- Dong Yu Li Deng. Deep Learning: Methods and Applications. NOW Publishers, May 2014. URL https://www.microsoft.com/en-us/research/publication/deep-learning-methods-and-applications/.
- Eric Makita and Artem Lenskiy. A movie genre prediction based on Multivariate Bernoulli model and genre correlations. (May), mar 2016a. URL http://arxiv.org/abs/1604.08608.
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
- J Ngiam, A Khosla, and M Kim. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689—-696, 2011. URL http://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf.
- Deli Pei, Huaping Liu, Yulong Liu, and Fuchun Sun. Unsupervised multimodal feature learning for semantic image segmentation. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE, aug 2013. ISBN 978-1-4673-6129-3. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6706748. [CrossRef]
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-Shot Learning Through Cross-Modal Transfer. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf.
- Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024a. [CrossRef]
- Hao Fei, Yafeng Ren, and Donghong Ji. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2151–2161, 2020a.
- Shengqiong Wu, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, and Donghong Ji. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 11513–11521, 2022.
- Wenxuan Shi, Fei Li, Jingye Li, Hao Fei, and Donghong Ji. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4232–4241, 2022.
- Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7692–7699, 2020b.
- Fengqi Wang, Fei Li, Hao Fei, Jingye Li, Shengqiong Wu, Fangfang Su, Wenxuan Shi, Donghong Ji, and Bo Cai. Entity-centered cross-document relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9871–9881, 2022.
- Ling Zhuang, Hao Fei, and Po Hu. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion, 100:101919, 2023. [CrossRef]
- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541, 2018.
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv preprint arXiv:2305.11719, 2023a.
- Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357, 2024.
- Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, 2017.
- Hao Fei, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, pages 15460–15475, 2022a.
- Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. Computational linguistics, 37(1):9–27, 2011. [CrossRef]
- Hao Fei, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics, 22(3), 2021. [CrossRef]
- Shengqiong Wu, Hao Fei, Wei Ji, and Tat-Seng Chua. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2593–2608, 2023b.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Hao Fei, Fei Li, Bobo Li, and Donghong Ji. Encoder-decoder based unified semantic role labeling with label-aware syntax. In Proceedings of the AAAI conference on artificial intelligence, pages 12794–12802, 2021a. [CrossRef]
- Hao Fei, Shengqiong Wu, Yafeng Ren, Fei Li, and Donghong Ji. Better combine them together! integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 549–559, 2021b.
- Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. Reasoning implicit sentiment with chain-of-thought prompting. arXiv preprint arXiv:2305.11255, 2023a.
- acob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. URL https://aclanthology.org/N19-1423. [CrossRef]
- Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023c. 0551; .9.
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Proceedings of the International Conference on Machine Learning, 2024b.
- Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, et al. Agribot: agriculture-specific question answer system. IndiaRxiv, 2019.
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024c.
- Mihir Momaya, Anjnya Khanna, Jessica Sadavarte, and Manoj Sankhe. Krushi–the farmer chatbot. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–6. IEEE, 2021. [CrossRef]
- Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pages 4096–4103, 2022b.
- Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, and Jingye Li. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 3957–3963, 2021.
- Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5923–5934, 2023. [CrossRef]
- Hao Fei, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5980–5994, 2023b.
- Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2024,, 2024d.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
- Abbott Chen and Chai Liu. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning, 11:1–10, 01 2021.
- Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv preprint arXiv:2406.05127, 2024.
- Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. MRN: A locally and globally mention-based reasoning network for document-level relation extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370, 2021.
- Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Zhang. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML, pages 6373–6391, 2022c.
- Hu Cao, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao, and Donghong Ji. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1953–1964, 2022.
- Isakwisa Gaddy Tende, Kentaro Aburada, Hisaaki Yamaba, Tetsuro Katayama, and Naonobu Okazaki. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy, 11(12):2411, 2021. [CrossRef]
- Hao Fei, Yafeng Ren, and Donghong Ji. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management, 57(6):102311, 2020c. [CrossRef]
- Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10965–10973, 2022. [CrossRef]
- Mohit Jain, Pratyush Kumar, Ishita Bhansali, Q Vera Liao, Khai Truong, and Shwetak Patel. Farmchat: a conversational agent to answer farmer queries. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(4):1–22, 2018b. [CrossRef]
- Shengqiong Wu, Hao Fei, Hanwang Zhang, and Tat-Seng Chua. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 79240–79259, 2023d.
- Hao Fei, Tat-Seng Chua, Chenliang Li, Donghong Ji, Meishan Zhang, and Yafeng Ren. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems, 41(2):50:1–50:32, 2023c. [CrossRef]
- Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, and Tat-Seng Chua. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5281–5291, 2023a. [CrossRef]
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14734–14751, 2023e.
- Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems, 34(9):5544–5556, 2023d. [CrossRef]
- Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Generating visual spatial description via holistic 3D scene understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7960–7977, 2023b.
- Bart Van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. arXiv preprint arXiv:1506.00619, 2015.
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164, 2015.
- Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012. [CrossRef]
| Model | CCC-Va | CCC-Ar |
|---|---|---|
| AffWildNet + Static (V) (2000) [47] | 0.176 | 0.198 |
| AffWildNet + Static (V) (4096) [47] | 0.244 | 0.297 |
| AffWildNet + Dynamic (V) | 0.315 | 0.357 |
| RNN (A) | 0.105 | 0.168 |
| VGGFace-RNN (V) | 0.379 | 0.481 |
| VGGFace-RNN (A+V) | 0.344 | 0.506 |
| NISL (V) [48] | 0.373 | 0.513 |
| T (A+V) [49] | 0.320 | 0.550 |
| ExpTrm (V) | 0.381 | |
| ExpTrm (A+V) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).