Submitted:
23 September 2024
Posted:
24 September 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Multimodal Fusion Research
2.2. Movie Genre Classification Application
3. Methods
3.1. Gated Multimodal Unit for Multimodal Fusion
3.2. Text Representation
- n-gram
- Inspired by the methodology proposed by Kanaris Stamatatos [33], we employed the n-gram approach for text representation. Despite its simplicity, the n-gram model serves as a robust baseline, effectively capturing local word dependencies and sequences within the text data.
- Word2Vec
- Word2Vec is an unsupervised learning framework that generates dense vector representations for words by analyzing their contextual co-occurrences [46]. These vector embeddings are capable of capturing complex semantic and syntactic relationships, enabling the model to perform operations such as word analogies through vector arithmetic. In our approach, each movie is represented by the average of the word vectors corresponding to the words in its plot outline. This averaging process leverages the additive compositionality property of word2vec, where the combined representation retains meaningful semantic information. By averaging rather than summing the vectors, we mitigate the risk of excessively large input values, thereby maintaining numerical stability during subsequent neural network processing.
- Recurrent Neural Network
-
For a more context-aware representation, we explored the use of Recurrent Neural Networks (RNNs) to model the sequential nature of textual data. Specifically, we investigated two variants:
- RNN_w2v: This variant employs transfer learning by utilizing pre-trained word2vec vectors as input embeddings. The RNN processes the sequence of word vectors, capturing temporal dependencies and contextual information to generate a comprehensive representation of the plot outline.
- RNN_end2end: In contrast, this variant learns word embeddings from scratch in an end-to-end manner, allowing the RNN to optimize the embeddings jointly with the classification task. This approach enables the model to tailor the embeddings specifically to the genre classification objective, potentially enhancing performance by capturing task-relevant features.
3.3. Visual Representation
- VGG Transfer
- This approach leverages the VGG Network [53], a deep CNN architecture renowned for its performance on the ImageNet dataset. By utilizing the pre-trained VGG model as a feature extractor, we extract the activations from the last hidden layer as the visual representation for each movie poster. This transfer learning strategy capitalizes on the rich feature representations learned from extensive image data, enabling effective utilization of visual information without the need for training a network from scratch.
- End-to-End CNN
- In contrast to the transfer learning approach, the end-to-end CNN strategy involves training a custom CNN architecture from the ground up, tailored specifically to our dataset and classification task. Our architecture comprises five convolutional layers designed to progressively extract higher-level features from the input images, followed by a Multi-Layer Perceptron (MLP) that serves as the classifier. This end-to-end training allows the CNN to learn feature representations that are highly specialized and optimized for the task of genre classification based on poster imagery.
3.4. Classification Model
- Logistic Regression
- As a baseline classification approach, we employed Logistic Regression, a well-established statistical method for binary and multiclass classification tasks. Logistic Regression models the probability of each genre label given the input feature vectors, making it a straightforward yet effective choice for multilabel classification scenarios.
- Neural Network Architecture
- To harness the expressive power of deep learning, we implemented a Multilayer Perceptron (MLP) with two fully connected layers, incorporating the Maxout activation function. The Maxout activation function, defined as:where is the input vector, represents the output of the j-th linear transformation for the i-th hidden unit, and and are the learnable parameters, offers several advantages. Maxout networks have been demonstrated to act as universal approximators with as few as two hidden units, providing the capability to model complex, non-linear functions [24]. Additionally, the Maxout activation function mitigates the issue of unit saturation, allowing for more stable and efficient training dynamics. By incorporating Maxout into the MLP, we aim to enhance the model’s capacity to capture intricate patterns and relationships within the multimodal feature space, thereby improving classification performance.
4. Experiments
4.1. Dataset
4.2. Experimental Configuration
Evaluation Metrics
- Sample-based Average (): This metric calculates the f-score for each individual sample and subsequently averages these scores across all samples.
- Micro-average (): This approach aggregates the contributions of all classes to compute the f-score globally by considering all true positives, false positives, and false negatives.
- Macro-average (): Here, the f-score is computed independently for each genre and then averaged, treating all genres equally regardless of their prevalence.
- Weighted Macro-average (): Similar to the macro-average, but each genre’s f-score is weighted by the number of true instances it has, providing a balance between rare and common genres.
- N denotes the total number of samples.
- Q represents the total number of genres.
- is the count of true instances for the j-th genre.
- and are the precision and recall for the j-th genre, respectively.
- and are the predicted and true binary label vectors for the i-th sample.
- , , and correspond to the true positives, false positives, and false negatives for the j-th genre.
Textual Feature Representation
Visual Feature Representation
Multimodal Feature Integration
- Average Probability (Late Fusion)
- This strategy involves averaging the probabilities output by the best-performing model for each modality and then applying a threshold to determine the final genre predictions.
- Concatenation
- Linear Sum
- Inspired by the approach of Vinyals et al. [60], this method applies a linear transformation to each modality’s representation to align their dimensionalities before summing them. The resultant vector is then processed by the MaxoutMLP classifier.
- Mixture of Experts (MoE)
- We adapted the MoE model [31] for multilabel classification by exploring two gating mechanisms: tied gating, where a single gate influences all logistic outputs, and untied gating, where each logistic output has its own dedicated gate. Both logistic regression and MaxoutMLP were evaluated as expert models within this framework.
Neural Network Training Procedures
5. Experimental Results
5.1. Performance on Synthetic Data
- C is the binary class label.
- M decides which modality holds the class-informative features.
- and are class-dependent features drawn from Gaussian distributions centered at and , respectively.
- and are noise features drawn from Gaussian distributions centered at and , respectively.
- The input features and are composites of either informative or noise features based on the value of M.
5.2. Genre Classification Performance
5.3. Detailed Genre-wise Analysis
5.4. Analysis of Modality Influence
5.5. Qualitative Analysis of Predictions
6. Conclusions and Future Directions
References
- Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang, Saeedeh Shekarpour, Johannes Hoffart, and Manohar Kaul. 2021. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network. In Proceedings of the Web Conference 2021. 1673–1685.
- Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. 2019. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM. 729–738.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.
- Endri Kacupaj, Kuldeep Singh, Maria Maleshkova, and Jens Lehmann. 2022. An Answer Verbalization Dataset for Conversational Question Answerings over Knowledge Graphs. arXiv preprint, 2022; arXiv:2208.06734.
- Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2021. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 459–469.
- Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, 4483–4491. Survey Track.
- Yunshi Lan and Jing Jiang. 2021. Modeling transitions of focal entities for conversational knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- Pierre Marion, Paweł Krzysztof Nowak, and Francesco Piccinno. 2021. Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
- Zeynep Akata, Honglak Lee, and Bernt Schiele. Zero-Shot Learning with Structured Embeddings. CoRR, abs/1409.8, 2014. URL. Available online: http://arxiv.org/abs/1409.8403.
- Deepa Anand. Evaluating folksonomy information sources for genre prediction. In Advance Computing Conference (IACC), 2014 IEEE International, pp. 887–892, feb 2014. [CrossRef]
- Galen Andrew, Raman Arora, Jeff A Bilmes, and Karen Livescu. Deep canonical correlation analysis. In ICML (3), pp. 1247–1255, 2013.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
- Pradeep, K.; Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 2010, 16, 345–379. [Google Scholar] [CrossRef]
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. The Journal of Machine Learning Research 2003, 3, 1137–1155.
- James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- Chidansh Bhatt and Mohan Kankanhalli. Multimedia data mining: state of the art and challenges. Multimedia Tools and Applications, 51(1):35–76, 2011. ISSN 1380-7501. [CrossRef]
- Adam Coates and Andrew Y, Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 921–928, 2011.
- Li Deng. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing, 3, 2014. APSIPA Transactions on Signal and Information Processing. Available online: http://journals.cambridge.org/article_S2048770313000097ISBN 2048-7703. [CrossRef]
- Fangxiang Feng, Ruifan Li, and Xiaojie Wang. Constructing hierarchical image-tags bimodal representations for word tags alternative choice. arXiv preprint arXiv:1307.1275, 2013.
- Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc∖textquotesingle Aurelio Ranzato, and Tomas Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 2121–2129. Curran Associates, Inc., 2013. URL. Available online: http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf.
- Zhikang Fu, Bing Li, Jun Li, and Shuhua Wei. Fast Film Genres Classification Combining Poster and Synopsis. In Xiaofei He, Xinbo Gao, Yanning Zhang, Zhi-Hua Zhou, Zhi-Yong Liu, Baochuan Fu, Fuyuan Hu, and Zhancheng Zhang (eds.), Lecture Notes in Computer Science, volume 9242 of Lecture Notes in Computer Science, pp. 72–81. Springer International Publishing, Cham, 2015. 10.1007/978-3-319-23989-78. URL. Available online: http://link.springer.com/10.1007/978-3-319-23862-3.
- Ian Goodfellow, David Warde-farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Sanjoy Dasgupta and David Mcallester (eds.), Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pp. 1319–1327. JMLR Workshop and Conference Proceedings, May 2013.
- Hao-Zhi Hong and Jen-Ing G Hwang. Multimodal PLSA for Movie Genre Classification. In Friedhelm Schwenker, Fabio Roli, and Josef Kittler (eds.), Multiple Classifier Systems: 12th International Workshop, MCS 2015, G{ü}nzburg, Germany, June 29 - July 1, 2015, Proceedings, pp. 159–167. Springer International Publishing, Cham, 2015. ISBN 978-3-319-20248-8. [CrossRef]
- Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882. Association for Computational Linguistics, 2012.
- Hui-Yu Huang, Weir-Sheng Shih, and Wen-Hsing Hsu. A Film Classifier Based on Low-level Visual Features. In 2007 IEEE 9th Workshop on Multimedia Signal Processing, volume 3, pp. 465–468. IEEE, 2007. ISBN 978-1-4244-1273-0. 10.1109/MMSP.2007.4412917. URL. Available online: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4412917.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pp. 448–456, 2015.
- Marina Ivasic-Kos, Miran Pobar, and Luka Mikec. Movie posters classification into genres based on low-level features. In 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), volume i, pp. 1198–1203. IEEE, may 2014. ISBN 978-953-233-077-9. 10.1109/MIPRO.2014.6859750. URL. Available online: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6859750.
- Marina Ivasic-Kos, Miran Pobar, and Ivo Ipsic. Automatic Movie Posters Classification into Genres. In Madevska Ana Bogdanova and Dejan Gjorgjevikj (eds.), ICT Innovations 2014: World of Data, pp. 319–328. Springer International Publishing, Cham, 201. Available online: http://dx.doi.org/10.1007/978-3-319-09879-1_32ISSN 978-3-319-09879-1.
- Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation 1991, 3, 79–87.
- Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint, 2015; arXiv:1511.07571, 2015.
- Ioannis Kanaris and Efstathios Stamatatos. Learning to recognize webpage genres. Information Processing and Management, 45(5):499–512, 2009. Available online: http://dx.doi.org/10.1016/j.ipm.2009.05.003ISSN 03064573. [CrossRef]
- Yoonseop Kang, Saehoon Kim, and Seungjin Choi. Deep learning to hash with multiple representations. In 2012 IEEE 12th International Conference on Data Mining, pp. 930–935. IEEE, 2012.
- Douwe Kiela and Léon Bottou. Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-14), 2014.
- Diederik Kingma and Jimmy, Ba. Adam: A method for stochastic optimization. arXiv preprint, 2014; arXiv:1412.6980. [Google Scholar]
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Multimodal neural language models. In ICML, volume 14, pp. 595–603, 2014a.
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. arXiv preprint, 2014b; arXiv:1411.2539.
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, may 2015. [CrossRef]
- Dong Yu Li Deng. Deep Learning: Methods and Applications. NOW Publishers, May 2014. URL. Available online: https://www.microsoft.com/en-us/research/publication/deep-learning-methods-and-applications/.
- Xinyan Lu, Fei Wu, Xi Li, Yin Zhang, Weiming Lu, Donghui Wang, and Yueting Zhuang. Learning multimodal neural network with ranking examples. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 985–988. ACM, 2014.
- Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Sašo Džeroski. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45(9):3084–3104, 2012. Available online: http://dx.doi.org/10.1016/j.patcog.2012.03.004. URL http://www.sciencedirect.com/science/article/pii/S0031320312001203ISSN 0031-3203. [CrossRef]
- Eric Makita and Artem Lenskiy. A movie genre prediction based on Multivariate Bernoulli model and genre correlations. (May), mar 2016a. URL. Available online: http://arxiv.org/abs/1604.08608.
- Eric Makita and Artem Lenskiy. A multinomial probabilistic model for movie genre predictions. 2016b. URL. Available online: http://arxiv.org/abs/1603.07849.
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint, 2014; arXiv:1410.1090.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint, 2013; arXiv:1301.3781.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119, 2013b.
- J Ngiam, A Khosla, and M Kim. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689—-696, 2011. URL. Available online: http://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf.
- Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeff Dean. Zero-Shot Learning by Convex Combination of Semantic Embeddings. CoRR, abs/1312.5, dec 2014. URL. Available online: http://arxiv.org/abs/1312.5650.
- Gregory Pais, Patrick Lambert, Daniel Beauchene, Francoise Deloule, and Bogdan Ionescu. Animated movie genre detection using symbolic fusion of text and image descriptors. In 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI), number 1, pp. 1–6. IEEE, jun 2012. Available online: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6269813ISBN 978-1-4673-2369-7. [CrossRef]
- Deli Pei, Huaping Liu, Yulong Liu, and Fuchun Sun. Unsupervised multimodal feature learning for semantic image segmentation. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE, aug 2013. Available online: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6706748ISBN 978-1-4673-6129-3. [CrossRef]
- Dharak Shah, Saheb Motiani, and Vishrut Patel. Movie Classification Using k-Means and Hierarchical Clustering. Technical report, 2013.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint, 2014; arXiv:1409.1556.
- Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-Shot Learning Through Cross-Modal Transfer. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc., 2013. URL. Available online: http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf.
- Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association for Computational Linguistics (TACL), 2(April):207–218, 2014. Available online: http://nlp.stanford.edu/~socherr/SocherLeManningNg_nipsDeepWorkshop2013.pdf.
- Nitish Srivastava and Ruslan Salakhutdinov. Multimodal Learning with Deep Boltzmann Machines. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 2222–2230. Curran Associates, Inc., 2012. URL. Available online: http://papers.nips.cc/paper/4683-multimodal-learning-with-deep-boltzmann-machines.pdf.
- Heung Il Suk and Dinggang Shen. Deep learning-based feature representation for AD/MCI classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 8150 LNCS, pp. 583–590, 2013. ISBN 9783642407628.
- Jian Tu, Zuxuan Wu, Qi Dai, Yu-Gang Jiang, and Xiangyang Xue. Challenge Huawei challenge: Fusing multimodal features with deep neural networks for Mobile Video Annotation. In Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on, pp. 1–6, 2014. [CrossRef]
- Bart Van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. arXiv preprint, 2015; arXiv:1506.00619.
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164, 2015.
- Pengcheng Wu, Steven C.H. Hoi, Hao Xia, Peilin Zhao, Dayong Wang, and Chunyan Miao. Online multimodal deep similarity learning with application to image retrieval. In Proceedings of the 21st ACM international conference on Multimedia - MM ’13, MM ’13, pp. 153–162, New York, New York, USA, 2013. ACM Press. Available online: http://doi.acm.org/10.1145/2502081.2502112%5Cnhttp://dl.acm.org/citation.cfm?doid=2502081.2502112ISBN 9781450324045. [CrossRef]
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.
- Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23(8):1177–1193, 2012.
- Yin Zheng, YJ Zhang, and Hugo Larochelle. Topic Modeling of Multimodal Data: an Autoregressive Approach. In IEEE Conference on Computer Vision and Pattern Recognition, 2014. Available online: http://www.dmi.usherb.ca/~larocheh/publications/ZhengY2014.pdfISBN 2011000211.
- Matthew J Smith. Getting value from artificial intelligence in agriculture. Animal Production Science, 2018.
- Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024a.
- Hao Fei, Yafeng Ren, and Donghong Ji. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2151–2161, 2020a.
- Shengqiong Wu, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, and Donghong Ji. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 11513–11521, 2022.
- Wenxuan Shi, Fei Li, Jingye Li, Hao Fei, and Donghong Ji. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4232–4241, 2022.
- Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7692–7699, 2020b.
- Fengqi Wang, Fei Li, Hao Fei, Jingye Li, Shengqiong Wu, Fangfang Su, Wenxuan Shi, Donghong Ji, and Bo Cai. Entity-centered cross-document relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9871–9881, 2022.
- Ling Zhuang, Hao Fei, and Po Hu. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion, 100:101919, 2023.
- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint, 2018; arXiv:1804.09541.
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv preprint, 2023; arXiv:2305.11719.
- Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint, 2024; arXiv:2405.18357.
- Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv preprint, 2017; arXiv:1704.05179.
- Hao Fei, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, pages 15460–15475, 2022a.
- Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. Computational linguistics 2011, 27, 9–37.
- Hao Fei, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics 2021, 22.
- Shengqiong Wu, Hao Fei, Wei Ji, and Tat-Seng Chua. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2593–2608, 2023b.
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint, 2016; arXiv:1606.05250.
- Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. Reasoning implicit sentiment with chain-of-thought prompting. arXiv preprint, 2023; arXiv:2305.11255.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. Available online: https://aclanthology.org/N19-1423. [CrossRef]
- Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023c.
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Proceedings of the International Conference on Machine Learning, 2024b.
- Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, et al. Agribot: agriculture-specific question answer system. IndiaRxiv.
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024c.
- Mihir Momaya, Anjnya Khanna, Jessica Sadavarte, and Manoj Sankhe. Krushi–the farmer chatbot. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–6. IEEE, 2021.
- Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pages 4096–4103, 2022b.
- Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, and Jingye Li. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 3957–3963, 2021.
- Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5923–5934, 2023.
- Hao Fei, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5980–5994, 2023b.
- Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. 2024d.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
- Abbott Chen and Chai Liu. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning, 11:1–10, 01 2021.
- Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv preprint, 2024; arXiv:2406.05127.
- Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. MRN: A locally and globally mention-based reasoning network for document-level relation extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370, 2021.
- Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Zhang. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML, pages 6373–6391, 2022c.
- Hu Cao, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao, and Donghong Ji. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1953–1964, 2022.
- Isakwisa Gaddy Tende, Kentaro Aburada, Hisaaki Yamaba, Tetsuro Katayama, and Naonobu Okazaki. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy, 11(12):2411, 2021.
- Hao Fei, Yafeng Ren, and Donghong Ji. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management, 57(6):102311, 2020c.
- Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10965–10973, 2022.
- Mohit Jain, Pratyush Kumar, Ishita Bhansali, Q Vera Liao, Khai Truong, and Shwetak Patel. Farmchat: a conversational agent to answer farmer queries. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(4):1–22, 2018b.
- Shengqiong Wu, Hao Fei, Hanwang Zhang, and Tat-Seng Chua. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 79240–79259, 2023d.
- Hao Fei, Tat-Seng Chua, Chenliang Li, Donghong Ji, Meishan Zhang, and Yafeng Ren. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems, 41(2):50:1–50:32, 2023c.
- Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, and Tat-Seng Chua. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5281–5291, 2023a.
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14734–14751, 2023e.
- Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems, 34(9):5544–5556, 2023d.
- Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Generating visual spatial description via holistic 3D scene understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7960–7977, 2023b.
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 |
| Genre | Train | Dev | Test | Genre | Train | Dev | Test | |
|---|---|---|---|---|---|---|---|---|
| Drama | 8424 | 1401 | 4142 | Family | 978 | 172 | 518 | |
| Comedy | 5108 | 873 | 2611 | Biography | 788 | 144 | 411 | |
| Romance | 3226 | 548 | 1590 | War | 806 | 128 | 401 | |
| Thriller | 3113 | 512 | 1567 | History | 680 | 118 | 345 | |
| Crime | 2293 | 382 | 1163 | Music | 634 | 100 | 311 | |
| Action | 2155 | 351 | 1044 | Animation | 586 | 105 | 306 | |
| Adventure | 1611 | 278 | 821 | Musical | 503 | 85 | 253 | |
| Horror | 1603 | 275 | 825 | Western | 423 | 72 | 210 | |
| Documentary | 1234 | 219 | 629 | Sport | 379 | 64 | 191 | |
| Mystery | 1231 | 209 | 617 | Short | 281 | 48 | 142 | |
| Sci-Fi | 1212 | 193 | 586 | Film-Noir | 202 | 34 | 102 | |
| Fantasy | 1162 | 186 | 585 |
| Modality | Representation | F-Score | |||
|---|---|---|---|---|---|
| weighted | samples | micro | macro | ||
| Multimodal | FGU | 0.617 | 0.630 | 0.630 | 0.541 |
| Linear_sum | 0.600 | 0.607 | 0.607 | 0.530 | |
| Concatenate | 0.597 | 0.605 | 0.606 | 0.521 | |
| AVG_probs | 0.604 | 0.616 | 0.615 | 0.491 | |
| MoE_MaxoutMLP | 0.592 | 0.593 | 0.601 | 0.516 | |
| MoE_MaxoutMLP (tied) | 0.579 | 0.579 | 0.587 | 0.489 | |
| MoE_Logistic | 0.541 | 0.557 | 0.565 | 0.456 | |
| MoE_Logistic (tied) | 0.483 | 0.507 | 0.518 | 0.358 | |
| Text | MaxoutMLP_w2v | 0.588 | 0.592 | 0.595 | 0.488 |
| RNN_transfer | 0.570 | 0.580 | 0.580 | 0.480 | |
| MaxoutMLP_w2v_1_hidden | 0.540 | 0.540 | 0.550 | 0.440 | |
| Logistic_w2v | 0.530 | 0.540 | 0.550 | 0.420 | |
| MaxoutMLP_3grams | 0.510 | 0.510 | 0.520 | 0.420 | |
| Logistic_3grams | 0.510 | 0.520 | 0.530 | 0.400 | |
| RNN_end2end | 0.490 | 0.490 | 0.490 | 0.370 | |
| Visual | VGG_Transfer | 0.410 | 0.429 | 0.437 | 0.284 |
| CNN_end2end | 0.370 | 0.350 | 0.340 | 0.210 | |
| Genre | Textual | Visual | FGU | Genre | Textual | Visual | FGU |
|---|---|---|---|---|---|---|---|
| Drama | 0.74 | 0.67 | 0.77 | Fantasy | 0.42 | 0.25 | 0.46 |
| Comedy | 0.65 | 0.59 | 0.68 | Family | 0.50 | 0.46 | 0.58 |
| Romance | 0.53 | 0.33 | 0.51 | Biography | 0.40 | 0.02 | 0.25 |
| Thriller | 0.57 | 0.39 | 0.62 | War | 0.57 | 0.19 | 0.64 |
| Crime | 0.61 | 0.25 | 0.59 | History | 0.35 | 0.06 | 0.29 |
| Action | 0.58 | 0.37 | 0.60 | Animation | 0.43 | 0.61 | 0.68 |
| Adventure | 0.51 | 0.32 | 0.51 | Musical | 0.14 | 0.18 | 0.28 |
| Horror | 0.65 | 0.41 | 0.69 | Western | 0.52 | 0.37 | 0.65 |
| Documentary | 0.67 | 0.18 | 0.76 | Sport | 0.64 | 0.11 | 0.70 |
| Mystery | 0.38 | 0.11 | 0.39 | Short | 0.20 | 0.24 | 0.27 |
| Sci-Fi | 0.63 | 0.30 | 0.66 | Film-Noir | 0.02 | 0.11 | 0.37 |
| Music | 0.51 | 0.01 | 0.48 |
| The World According to Sesame Street | ||
|---|---|---|
| Ground Truth | Documentary | |
| Textual | Documentary, History | |
| Visual | Comedy, Adventure, Family, Animation | |
| FGU | Documentary | |
| Babar: The Movie | ||
| Ground Truth | Adventure, Fantasy, Family, Animation, Musical | |
| Textual | Adventure, Documentary, War, Music | |
| Visual | Comedy, Adventure, Family, Animation | |
| FGU | Adventure, Family, Animation | |
| Letters from Iwo Jima | ||
| Ground Truth | Drama, War, History | |
| Textual | Drama, Action, War, History | |
| Visual | Thriller, Action, Adventure, Sci-Fi | |
| FGU | Drama, War, History | |
| The Last Elvis | ||
| Ground Truth | Drama | |
| Textual | Comedy, Documentary, Family, Biography, Music | |
| Visual | Drama, Romance | |
| FGU | Drama | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).