Submitted:
23 September 2024
Posted:
24 September 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Enhanced Multimodal Feature Learning (EMFL)
3.1. Gaussian Sampling Layer
3.2. Enhanced Multimodal Feature Learning Algorithm
3.2.1. Representation Learning Phase
3.2.2. Confounding Representation Matching Phase
3.2.3. Confounding Representation Elimination Phase
3.3. Algorithmic Workflow of EMFL
- Initialization: Begin with a pre-trained discriminative neural network, comprising the representation learner and the classifier .
- Representation Learning Phase: Fine-tune the parameters and by minimizing the primary loss function (Equation 2), thereby optimizing the mixed representation learning without considering confounders.
- Confounding Representation Matching Phase: Introduce the confounding representation learner and optimize by minimizing the loss function in Equation 3. This phase isolates the identity-related confounding dimensions within the representation.
- Confounding Representation Elimination Phase: Incorporate Gaussian noise into the identified confounding dimensions and retrain the classifier by minimizing the loss function in Equation 4. This step ensures that the classifier learns to ignore the noised confounders.
- Iteration: Repeat steps 2 through 4 until convergence is achieved, i.e., when further iterations do not yield significant improvements in performance.
- Final Model: The resultant model, now robust against confounding factors, is evaluated on unseen test data to assess its generalization capabilities.
3.4. Advantages of EMFL
- Robustness to Confounders: By explicitly identifying and mitigating the influence of identity-related confounding dimensions, EMFL ensures that the sentiment predictions are not biased by irrelevant identity features.
- Versatility: EMFL is model-agnostic and can be integrated into a wide range of deep learning architectures, including CNNs, LSTMs, and autoencoders, as well as pre-trained models, thereby broadening its applicability.
- Improved Generalization: By focusing on sentiment-relevant features, EMFL enhances the model’s ability to generalize to new, unseen data, thereby improving overall prediction performance.
- Scalability: The framework is scalable to large datasets with numerous features and identities, making it suitable for real-world applications where data complexity is high.
- Ease of Integration: The three-phase approach of EMFL allows for straightforward integration into existing training pipelines without necessitating significant architectural modifications.
3.5. Implementation Considerations
- Hyperparameter Tuning: The selection of appropriate values for the hyperparameters and is crucial. These parameters control the sparsity of the confounding dimension selection and the magnitude of the introduced noise, respectively. Techniques such as cross-validation and grid search can be employed to identify optimal values.
- Computational Overhead: Introducing additional components such as the confounding representation learner and the Gaussian Sampling Layer (GSL) may increase computational requirements. Efficient implementation strategies and hardware acceleration can mitigate potential performance bottlenecks.
- Data Quality and Representation: The effectiveness of EMFL is contingent upon the quality and representativeness of the input features. Ensuring comprehensive and relevant feature extraction across all modalities is essential for the successful identification of confounding dimensions.
- Scalability to Multiple Confounders: While EMFL is designed to handle identity-related confounders, extending the framework to account for multiple types of confounders may require additional modifications and complexity.
- Evaluation Metrics: Employing appropriate evaluation metrics that accurately reflect the model’s ability to generalize and its robustness to confounders is vital. Metrics such as accuracy, F1-score, and area under the ROC curve (AUC) can provide comprehensive insights into model performance.
4. Experiments
| Within Dataset | Across Datasets | ||||||
|---|---|---|---|---|---|---|---|
| MOSI | YouTube | MOUD | |||||
| CNN | EMFL-CNN | CNN | EMFL-CNN | CNN | EMFL-CNN | ||
| Text | 0.678 | 0.732 | 0.605 | 0.657 | 0.522 | 0.569 | |
| Single Modality | Audio | 0.588 | 0.618 | 0.441 | 0.564 | 0.455 | 0.549 |
| Video | 0.572 | 0.636 | 0.492 | 0.549 | 0.555 | 0.548 | |
| Text+Audio | 0.687 | 0.725 | 0.642 | 0.652 | 0.515 | 0.574 | |
| Double Modalities | Text+Video | 0.706 | 0.73 | 0.642 | 0.667 | 0.542 | 0.574 |
| Audio+Video | 0.661 | 0.621 | 0.452 | 0.559 | 0.533 | 0.554 | |
| All Modalities | 0.715 | 0.73 | 0.611 | 0.667 | 0.531 | 0.574 | |
4.1. Model Architectures
4.2. Datasets Utilized
4.3. Feature Extraction Techniques
4.4. Experimental Setup
- MOSI Test Set: Comprising 546 utterances from the remaining 31 individuals in the MOSI dataset.
- YouTube Test Set: Consisting of 195 utterances from 47 unique individuals in the YouTube dataset.
- MOUD Test Set: Encompassing 450 utterances from 55 individuals in the MOUD dataset.
4.5. Experimental Results
4.5.1. Within-Dataset Evaluation
4.5.2. Cross-Dataset Evaluation
Modality-Specific Insights
- Semantic Richness: Sentiment is inherently more nuanced and accurately captured through textual expressions, which provide explicit cues compared to the more ambiguous visual or acoustic signals.
- Language Independence from Identity: Textual information is less likely to be influenced by an individual’s identity, reducing the risk of confounding factors impacting sentiment prediction.
5. Conclusions and Future Work
5.1. Conclusion
5.2. Future Work
References
- Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi, “Towards multimodal sentiment analysis: Harvesting opinions from the web,” in Proceedings of the 13th international conference on multimodal interfaces. ACM, 2011.
- Bo Pang and Lillian Lee, “Opinion mining and sentiment analysis,” Foundations and trends in information retrieval, 2008.
- Akshi Kumar and Mary Sebastian Teeja, “Sentiment analysis: A perspective on its past, present and future,” International Journal of Intelligent Systems and Applications, 2012.
- Martin Wollmer, Felix Weninger, Timo Knaup, Bjorn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency, “Youtube movie reviews: Sentiment analysis in an audio-visual context,” Intelligent Systems, IEEE, 2013. [CrossRef]
- Verónica Pérez Rosas, Rada Mihalcea, and Louis-Philippe Morency, “Multimodal sentiment analysis of spanish online videos,” IEEE Intelligent Systems, , no. 3, 2013. [CrossRef]
- Amir Zadeh, “Micro-opinion sentiment intensity analysis and summarization in online videos,” in ICMI. ACM, 2015.
- Robert M Ewers and Raphael K Didham, “Confounding factors in the detection of species responses to habitat fragmentation,” Biological Reviews, 2006. [CrossRef]
- Haohan Wang and Jingkang Yang, “Multiple confounders correction with regularized linear mixed effect models, with application in biological processes,” in BIBM. IEEE, 2016.
- Lingxiang Wu, Jinqiao Wang, Guibo Zhu, Min Xu, and Hanqing Lu, “Person re-identification via rich color-gradient feature,” in ICME. IEEE, 2016.
- Xiaoke Zhu, Xiao-Yuan Jing, Fei Wu, Weishi Zheng, Ruimin Hu, Chunxia Xiao, and Chao Liang, “Distance learning by treating negative samples differently and exploiting impostors with symmetric triplet constraint for person re-identification,” in ICME. IEEE, 2016.
- Antonio Tejero-de Pablos, Yuta Nakashima, Tomokazu Sato, and Naokazu Yokoya, “Human action recognition-based video summarization for rgb-d personal sports video,” in ICME. IEEE, 2016.
- Ying Zhao, Huijun Di, Jian Zhang, Yao Lu, and Feng Lv, “Recognizing human actions from low-resolution videos by region-based mixture models,” in ICME. IEEE, 2016.
- Zhongjun Wu and Weihong Deng, “One-shot deep neural network for pose and illumination normalization face recognition,” in ICME. IEEE, 2016.
- Binghui Chen and Weihong Deng, “Weakly-supervised deep self-learning for face recognition,” in ICME. IEEE, 2016.
- Erik Cambria, Daniel Olsher, and Dheeraj Rajagopal, “Senticnet 3: a common and common-sense knowledge base for cognition-driven sentiment analysis,” in AAAI. AAAI Press, 2014. [CrossRef]
- Theresa Wilson, Janyce Wiebe, and Paul Hoffmann, “Recognizing contextual polarity in phrase-level sentiment analysis,” in EMNLP. Association for Computational Linguistics, 2005.
- Ellen Riloff and Janyce Wiebe, “Learning extraction patterns for subjective expressions,” in EMNLP. Association for Computational Linguistics, 2003.
- Lakshmish Kaushik, Abhijeet Sangwan, and John HL Hansen, “Sentiment extraction from natural audio streams,” in ICASSP. IEEE, 2013.
- Boya Wu, Jia Jia, Tao He, Juan Du, Xiaoyuan Yi, and Yishuang Ning, “Inferring users’emotions for human-mobile voice dialogue applications,” .
- Paul Ekman and Wallace V Friesen, “Facial action coding system,” 1977.
- Ming Sun, Jufeng Yang, Kai Wang, and Hui Shen, “Discovering affective regions in deep convolutional neural networks for visual sentiment prediction,” in ICME. IEEE, 2016.
- Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency, “Utterance-level multimodal sentiment analysis.,” in ACL, 2013.
- Luca Casaburi, Francesco Colace, Massimo De Santo, and Luca Greco, ““magic mirror in my hand, what is the sentiment in the lens?”: An action unit based approach for mining sentiments from multimedia contents,” Journal of Visual Languages & Computing.
- Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain, “Fusing audio, visual and textual clues for sentiment analysis from multimodal content,” Neurocomputing. [CrossRef]
- Soujanya Poria, Erik Cambria, and Alexander Gelbukh, “Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis,” in EMNLP, 2015, pp. 2539–2544. [CrossRef]
- Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. arXiv:1312.6114.
- Haohan Wang and Bhiksha Raj, “On the origin of deep learning,” arXiv preprint arXiv:1702.07800, 2017. arXiv:1702.07800.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. arXiv:1301.3781.
- Florian Eyben, Martin Wöllmer, and Björn Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in International conference on Multimedia. ACM, 2010.
- Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency, “3d constrained local model for rigid and non-rigid facial tracking,” in CVPR. IEEE, 2012, pp. 2610–2617.
- Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang, Saeedeh Shekarpour, Johannes Hoffart, and Manohar Kaul. 2021. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network. In Proceedings of the Web Conference 2021. 1673–1685.
- Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. 2019. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM. 729–738.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.
- Endri Kacupaj, Kuldeep Singh, Maria Maleshkova, and Jens Lehmann. 2022. An Answer Verbalization Dataset for Conversational Question Answerings over Knowledge Graphs. arXiv preprint arXiv:2208.06734 (2022).
- Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2021. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 459–469.
- Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, 4483–4491. Survey Track.
- Yunshi Lan and Jing Jiang. 2021. Modeling transitions of focal entities for conversational knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- Pierre Marion, Paweł Krzysztof Nowak, and Francesco Piccinno. 2021. Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021).
- Matthew J Smith. Getting value from artificial intelligence in agriculture. Animal Production Science, 2018. [CrossRef]
- Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. [CrossRef]
- Hao Fei, Yafeng Ren, and Donghong Ji. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2151–2161, 2020.
- Shengqiong Wu, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, and Donghong Ji. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 11513–11521, 2022. [CrossRef]
- Wenxuan Shi, Fei Li, Jingye Li, Hao Fei, and Donghong Ji. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4232–4241, 2022.
- Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7692–7699, 2020.
- Fengqi Wang, Fei Li, Hao Fei, Jingye Li, Shengqiong Wu, Fangfang Su, Wenxuan Shi, Donghong Ji, and Bo Cai. Entity-centered cross-document relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9871–9881, 2022.
- Ling Zhuang, Hao Fei, and Po Hu. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion, 100:101919, 2023. [CrossRef]
- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541, 2018.
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv preprint arXiv:2305.11719, 2023.
- Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357, 2024.
- Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, 2017.
- Hao Fei, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, pages 15460–15475, 2022.
- Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. Computational linguistics, 37(1):9–27, 2011. [CrossRef]
- Hao Fei, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics, 22(3), 2021. [CrossRef]
- Shengqiong Wu, Hao Fei, Wei Ji, and Tat-Seng Chua. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2593–2608, 2023.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. Reasoning implicit sentiment with chain-of-thought prompting. arXiv preprint arXiv:2305.11255, 2023.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023.
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [CrossRef]
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Proceedings of the International Conference on Machine Learning, 2024.
- Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, et al. Agribot: agriculture-specific question answer system. IndiaRxiv, 2019.
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024.
- Mihir Momaya, Anjnya Khanna, Jessica Sadavarte, and Manoj Sankhe. Krushi–the farmer chatbot. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–6. IEEE, 2021.
- Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pages 4096–4103, 2022.
- Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, and Jingye Li. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 3957–3963, 2021.
- Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5923–5934, 2023.
- Hao Fei, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5980–5994, 2023.
- Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. 2024d.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
- Abbott Chen and Chai Liu. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning, 11:1–10, 01 2021. [CrossRef]
- Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv preprint arXiv:2406.05127, 2024.
- Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. MRN: A locally and globally mention-based reasoning network for document-level relation extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370, 2021.
- Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Zhang. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML, pages 6373–6391, 2022.
- Hu Cao, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao, and Donghong Ji. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1953–1964, 2022.
- Isakwisa Gaddy Tende, Kentaro Aburada, Hisaaki Yamaba, Tetsuro Katayama, and Naonobu Okazaki. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy, 11(12):2411, 2021. [CrossRef]
- Hao Fei, Yafeng Ren, and Donghong Ji. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management, 57(6):102311, 2020c. [CrossRef]
- Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10965–10973, 2022. [CrossRef]
- Mohit Jain, Pratyush Kumar, Ishita Bhansali, Q Vera Liao, Khai Truong, and Shwetak Patel. Farmchat: a conversational agent to answer farmer queries. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(4):1–22, 2018.
- Shengqiong Wu, Hao Fei, Hanwang Zhang, and Tat-Seng Chua. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 79240–79259, 2023.
- Hao Fei, Tat-Seng Chua, Chenliang Li, Donghong Ji, Meishan Zhang, and Yafeng Ren. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems, 41(2):50:1–50:32, 2023. [CrossRef]
- Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, and Tat-Seng Chua. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5281–5291, 2023.
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14734–14751, 2023.
- Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems, 34(9):5544–5556, 2023. [CrossRef]
- Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Generating visual spatial description via holistic 3D scene understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7960–7977, 2023.
| 1 |
| CNN | EMFL-CNN | ||
|---|---|---|---|
| Unimodal | Verbal | 0.678 | 0.732 |
| Acoustic | 0.588 | 0.618 | |
| Visual | 0.572 | 0.636 | |
| Bimodal | Verbal+Acoustic | 0.687 | 0.725 |
| Verbal+Visual | 0.706 | 0.73 | |
| Acoustic+Visual | 0.661 | 0.621 | |
| All Modalities | 0.715 | 0.73 | |
| YouTube | MOUD | |||
|---|---|---|---|---|
| CNN | EMFL-CNN | CNN | EMFL-CNN | |
| Verbal | 0.605 | 0.657 | 0.522 | 0.569 |
| Acoustic | 0.441 | 0.564 | 0.455 | 0.549 |
| Visual | 0.492 | 0.549 | 0.555 | 0.548 |
| Verbal+Acoustic | 0.642 | 0.652 | 0.515 | 0.574 |
| Verbal+Visual | 0.642 | 0.667 | 0.542 | 0.574 |
| Acoustic+Visual | 0.452 | 0.559 | 0.533 | 0.554 |
| All Modalities | 0.611 | 0.667 | 0.531 | 0.574 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).