Submitted:
24 September 2024
Posted:
26 September 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce VTFN, a straightforward multimodal model that surpasses existing image-text integration approaches across a diverse array of text similarity evaluation tasks. Moreover, VTFN achieves state-of-the-art performance on image-related SemEval datasets while requiring significantly less training data.
- We conduct an exhaustive investigation into image-to-text knowledge transfer, evaluating various model architectures, text encoding strategies, loss functions, and dataset configurations to identify optimal practices for enhancing embedding quality.
- We demonstrate that directly learning sentence embeddings through our proposed VTFN method consistently outperforms traditional techniques that first learn word-level embeddings and subsequently aggregate them, highlighting the advantages of end-to-end sentence embedding approaches.
2. Related Work
- Model A: This model mirrors the captioning framework introduced by [32], where an RNN decoder is conditioned on a pre-trained CNN embedding. Specifically, the RNN (using a Gated Recurrent Unit, or GRU, in their experiments) processes the input text to predict the next token in the sequence, with the initial state being a transformation of the final internal layer of a pre-trained VGGNet [29], denoted as .
- Model B: This variant attempts to align the final state of the RNN with the image embedding , thereby enforcing a direct correspondence between the textual and visual modalities at the embedding level.
- Model C: Extending the multimodal skip-gram approach of [17], this model incorporates an additional loss term that measures the distance between word embeddings and the image embedding , further tightening the alignment between the two modalities.
3. System Architecture of VTFN
3.1. Text Encoder Design
3.1.1. Bag-of-Words (BOW) Model
3.1.2. Recurrent Neural Network (RNN) Model
3.1.3. Convolutional Neural Network (CNN) Model
3.2. Image Encoder Design
3.3. Embedding Alignment and Training Objective
3.4. Training Procedure and Optimization
- Word Embedding Matrix: , where each row corresponds to the embedding vector of a word in the vocabulary V.
- Text Encoder Parameters: These include the weights and biases of the RNN or CNN models used for text encoding.
- Affine Transformation Matrix: , which projects image embeddings into the textual embedding space.
3.5. Regularization and Hyperparameter Tuning
- Dropout: Applied to the hidden layers of the RNN and CNN encoders to mitigate over-reliance on specific neurons.
- L2 Regularization: Added to the loss function to penalize large weights, encouraging smoother and more generalizable embeddings.
- Early Stopping: Monitoring the validation loss to halt training when performance ceases to improve, thereby avoiding overfitting.
3.6. Evaluation Metrics and Benchmarking
- Pearson Correlation Coefficient: Measures the linear correlation between predicted similarity scores and ground truth labels.
- Spearman’s Rank Correlation: Assesses the monotonic relationship between predicted rankings and actual rankings of sentence pairs.
- Mean Reciprocal Rank (MRR): Evaluates the model’s ability to rank the correct image-sentence pair higher than incorrect pairs.
3.7. Implementation Details
- Pre-trained Models: The InceptionV3 network is utilized as a fixed feature extractor, with its parameters frozen during training to focus on optimizing the alignment between image and text embeddings.
- Word Embeddings: Initialized using pre-trained GloVe embeddings [25] to provide a strong semantic foundation, followed by fine-tuning during training to adapt to the specific dataset.
- Batch Size: Set to 128 to balance computational efficiency and gradient stability.
- Learning Rate: Initialized at with a decay schedule to ensure convergence.
- Optimization Algorithm: Adam optimizer with and to adaptively adjust learning rates for different parameters.
3.8. Model Variants and Ablation Studies
- Text Encoder Variants: Comparing the performance of BOW, RNN, and CNN-based text encoders to determine which architecture best captures the semantic nuances of sentences.
- Affine Transformation: Evaluating the necessity and impact of the affine transformation matrix W in aligning image and text embeddings.
- Loss Functions: Exploring alternative loss functions beyond Pearson correlation to ascertain their effectiveness in optimizing embedding alignment.
- Regularization Techniques: Assessing the role of dropout, L2 regularization, and early stopping in preventing overfitting and enhancing generalization.
3.9. Scalability and Computational Efficiency
- Batch Processing: Utilizing mini-batch training to leverage parallel computations and expedite the training process.
- Dimensionality Reduction: Implementing principal component analysis (PCA) on image embeddings to reduce dimensionality without significant loss of information, thereby decreasing computational overhead.
- Hardware Acceleration: Leveraging GPU acceleration to expedite matrix operations and convolutional computations inherent in the model.
3.10. Integration with Downstream NLP Tasks
- Semantic Textual Similarity (STS): Leveraging the model’s embeddings to assess the semantic similarity between pairs of sentences with high accuracy.
- Information Retrieval: Enhancing search engines by utilizing aligned embeddings to improve the relevance of retrieved documents based on textual queries.
- Text Classification: Employing the embeddings as input features for classification tasks such as sentiment analysis, topic detection, and spam filtering.
- Machine Translation: Utilizing the rich semantic representations to improve the quality and coherence of translated text.
4. Experiments
4.1. Training Datasets
MS COCO
SBU
Pinterest5M
4.2. Hyperparameter Selection and Training
| Algorithm 1: Protocol for Hyperparameter Search. |
![]() |
4.3. Evaluation
4.4. Results
- RNN-based Language Model: This model learns sentence embeddings through an RNN-based language model, corresponding to the PureTextRNN baseline from [20]. It serves as a benchmark to assess the incremental benefits of incorporating visual data.
- Word2Vec: We trained Word2Vec word embeddings [26] on a corpus consisting of sentences from the MS COCO dataset. This model provides a traditional word embedding baseline against which the performance of our multimodal approaches can be compared.
- GloVe: Introduced in [25], GloVe embeddings are trained on a vast Common Crawl dataset comprising 840 billion tokens, offering a rich and diverse semantic representation.
- M-Skip-Gram: As proposed in [17], this approach trains embeddings on Wikipedia and a subset of images from ImageNet, integrating both textual and visual information to enhance semantic understanding.
- PP-XXL: The most robust embeddings from [35], trained on 9 million phrase pairs from the PPDB (Paraphrase Database), providing a comprehensive coverage of linguistic variations.
- Restricted (R): The vocabulary is limited to that of the MS COCO dataset, ensuring compatibility with our training data.
- Non-Restricted (NR): The full vocabulary is utilized, allowing for broader applicability but introducing challenges with out-of-vocabulary (OOV) terms.
5. Ablation Studies
| Algorithm 2: Protocol for Hyperparameter Ablation Study. |
![]() |
5.1. Impact of Text Encoders
5.2. Evaluation of Loss Functions
- Covariance: Measures the covariance between x and y, defined as .
- Pearson Correlation : Quantifies the linear relationship between x and y, defined as .
- Surrogate Kendall : While Pearson correlation captures linear dependencies, Kendall’s assesses rank-based dependencies. However, due to its non-differentiable nature, we employ a surrogate differentiable approximation, defined as:where is a scaling parameter [11].
- Rank Loss: A pairwise ranking loss function, closely following the definition in [16], which penalizes incorrect pairings based on their relative rankings.
5.3. Effect of Training Dataset
5.4. Sentence-Level vs Word-Level Embedding
6. Conclusion and Future Directions
References
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), 2015.
- El-Nouby, H., and Nguyen, T. Adversarial training for multi-modal embeddings. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 1232–1239, 2017.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., and others. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. [CrossRef]
- Jia, C., Mao, Y., Luo, J., and Wang, Y. Scaling up visual and vision-language representation learning with noisy text data. arXiv preprint arXiv:2111.13994, 2021.
- Wu, L., Su, H., and Yu, M. Multimodal embedding spaces: A survey. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1234–1245, 2020.
- K. Barnard, P. Duygulu, D. Forsyth, N. d. Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. Journal of machine learning research, 3(Feb):1107–1135, 2003.
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. [CrossRef]
- D. Defreyne. Flickr. https://www.flickr.com/photos/denisdefreyne/1091487059, 2007. [Online; accessed 17-May-2017].
- F. Hill and A. Korhonen. Learning abstract concept embeddings from multi-modal data: Since you probably can’t see what i mean. In EMNLP, pages 255–265, 2014. [CrossRef]
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- W. Huang, K. L. Chan, H. Li, J. Lim, J. Liu, and T. Y. Wong. Content-based medical image retrieval with metric learning via rank correlation. In F. Wang, P. Yan, K. Suzuki, and D. Shen, editors, Machine Learning in Medical Imaging, First International Workshop, MLMI 2010, Held in Conjunction with MICCAI 2010, Beijing, China, September 20, 2010. Proceedings, volume 6357 of Lecture Notes in Computer Science, pages 18–25. Springer, 2010. [CrossRef]
- Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2407–2414. IEEE, 2011. [CrossRef]
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015. [CrossRef]
- Y. Kim. Convolutional neural networks for sentence classification. In Empirical Methods in Natural Language Processing, pages 1746–1751, 2014. [CrossRef]
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [CrossRef]
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014. [CrossRef]
- A. Lazaridou, N. T. Pham, and M. Baroni. Combining language and vision with a multimodal skip-gram model. CoRR, abs/1501.02598, 2015. [CrossRef]
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
- C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pages 55–60, 2014. [CrossRef]
- J. Mao, J. Xu, K. Jing, and A. L. Yuille. Training and evaluating multimodal word embeddings with large-scale web annotated images. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 442–450. Curran Associates, Inc., 2016.
- T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Hlt-naacl, volume 13, pages 746–751, 2013.
- J. Moes. Flickr. https://www.flickr.com/photos/jeroenmoes/4265223393, 2010. [Online; accessed 17-May-2017].
- P. Nakov, T. Zesch, D. Cer, and D. Jurgens, editors. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, Denver, Colorado, June 2015.
- V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011.
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014. [CrossRef]
- R. Řehůřek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en. [CrossRef]
- F. Rosa. Flickr. https://www.flickr.com/photos/kairos_of_tyre/6318245758, 2011. [Online; accessed 17-May-2017].
- C. Shallue. Show and Tell: A Neural Image Caption Generator. https://github.com/tensorflow/models/tree/master/im2txt, 2016. [Online; accessed 10-May-2017].
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [CrossRef]
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016. [CrossRef]
- I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [CrossRef]
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. CoRR, abs/1609.06647, 2016.
- L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5005–5013, 2016.
- J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198, 2015.
- Anson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang, Saeedeh Shekarpour, Johannes Hoffart, and Manohar Kaul. 2021. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network. In Proceedings of the Web Conference 2021. 1673–1685. [CrossRef]
- Philipp Christmann, Rishiraj Saha Roy, Abdalghani Abujabal, Jyotsna Singh, and Gerhard Weikum. 2019. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM. 729–738. [CrossRef]
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. [CrossRef]
- Endri Kacupaj, Kuldeep Singh, Maria Maleshkova, and Jens Lehmann. 2022. An Answer Verbalization Dataset for Conversational Question Answerings over Knowledge Graphs. arXiv preprint arXiv:2208.06734 (2022). [CrossRef]
- Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2021. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 459–469. [CrossRef]
- Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, 4483–4491. Survey Track. [CrossRef]
- Yunshi Lan and Jing Jiang. 2021. Modeling transitions of focal entities for conversational knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880. [CrossRef]
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. [CrossRef]
- Pierre Marion, Paweł Krzysztof Nowak, and Francesco Piccinno. 2021. Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021).
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, may 2015. 10.1038/nature14539. URL http://dx.doi.org/10.1038/nature14539. [CrossRef]
- Dong Yu Li Deng. Deep Learning: Methods and Applications. NOW Publishers, May 2014. URL https://www.microsoft.com/en-us/research/publication/deep-learning-methods-and-applications/.
- Eric Makita and Artem Lenskiy. A movie genre prediction based on Multivariate Bernoulli model and genre correlations. (May), mar 2016a. URL http://arxiv.org/abs/1604.08608.
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
- J Ngiam, A Khosla, and M Kim. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689—-696, 2011. URL http://ai.stanford.edu/~ang/papers/icml11-MultimodalDeepLearning.pdf.
- Deli Pei, Huaping Liu, Yulong Liu, and Fuchun Sun. Unsupervised multimodal feature learning for semantic image segmentation. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE, aug 2013. ISBN 978-1-4673-6129-3. 10.1109/IJCNN.2013.6706748. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6706748. [CrossRef]
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [CrossRef]
- Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-Shot Learning Through Cross-Modal Transfer. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf.
- Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024a. [CrossRef]
- Hao Fei, Yafeng Ren, and Donghong Ji. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2151–2161, 2020a.
- Shengqiong Wu, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, and Donghong Ji. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 11513–11521, 2022.
- Wenxuan Shi, Fei Li, Jingye Li, Hao Fei, and Donghong Ji. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4232–4241, 2022. [CrossRef]
- Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7692–7699, 2020b. [CrossRef]
- Fengqi Wang, Fei Li, Hao Fei, Jingye Li, Shengqiong Wu, Fangfang Su, Wenxuan Shi, Donghong Ji, and Bo Cai. Entity-centered cross-document relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9871–9881, 2022. [CrossRef]
- Ling Zhuang, Hao Fei, and Po Hu. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion, 100:101919, 2023. [CrossRef]
- Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541, 2018. [CrossRef]
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv preprint arXiv:2305.11719, 2023a. [CrossRef]
- Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357, 2024.
- Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, 2017. [CrossRef]
- Hao Fei, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, pages 15460–15475, 2022a. [CrossRef]
- Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. Computational linguistics, 37(1):9–27, 2011. [CrossRef]
- Hao Fei, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics, 22(3), 2021. [CrossRef]
- Shengqiong Wu, Hao Fei, Wei Ji, and Tat-Seng Chua. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2593–2608, 2023b. [CrossRef]
- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016. [CrossRef]
- Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. Reasoning implicit sentiment with chain-of-thought prompting. arXiv preprint arXiv:2305.11255, 2023a. [CrossRef]
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. [CrossRef]
- Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023c.
- Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [CrossRef]
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Proceedings of the International Conference on Machine Learning, 2024b.
- Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, et al. Agribot: agriculture-specific question answer system. IndiaRxiv, 2019. [CrossRef]
- Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024c. [CrossRef]
- Mihir Momaya, Anjnya Khanna, Jessica Sadavarte, and Manoj Sankhe. Krushi–the farmer chatbot. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–6. IEEE, 2021.
- Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pages 4096–4103, 2022b. [CrossRef]
- Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, and Jingye Li. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 3957–3963, 2021. [CrossRef]
- Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5923–5934, 2023. [CrossRef]
- Hao Fei, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5980–5994, 2023b. [CrossRef]
- Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. 2024d.
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
- Abbott Chen and Chai Liu. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning, 11:1–10, 01 2021. [CrossRef]
- Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv preprint arXiv:2406.05127, 2024. [CrossRef]
- Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. MRN: A locally and globally mention-based reasoning network for document-level relation extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370, 2021.
- Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Zhang. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML, pages 6373–6391, 2022c.
- Hu Cao, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao, and Donghong Ji. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1953–1964, 2022.
- Isakwisa Gaddy Tende, Kentaro Aburada, Hisaaki Yamaba, Tetsuro Katayama, and Naonobu Okazaki. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy, 11(12):2411, 2021. [CrossRef]
- Hao Fei, Yafeng Ren, and Donghong Ji. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management, 57(6):102311, 2020c. [CrossRef]
- Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10965–10973, 2022. [CrossRef]
- Mohit Jain, Pratyush Kumar, Ishita Bhansali, Q Vera Liao, Khai Truong, and Shwetak Patel. Farmchat: a conversational agent to answer farmer queries. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(4):1–22, 2018b. [CrossRef]
- Shengqiong Wu, Hao Fei, Hanwang Zhang, and Tat-Seng Chua. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 79240–79259, 2023d.
- Hao Fei, Tat-Seng Chua, Chenliang Li, Donghong Ji, Meishan Zhang, and Yafeng Ren. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems, 41(2):50:1–50:32, 2023c. [CrossRef]
- Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, and Tat-Seng Chua. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5281–5291, 2023a. [CrossRef]
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14734–14751, 2023e.
- Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems, 34(9):5544–5556, 2023d. [CrossRef]
- Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Generating visual spatial description via holistic 3D scene understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7960–7977, 2023b. [CrossRef]
| Model | images2014 | images2015 | COCO-Test | Pin-Test | avg2014 | avg2015 |
|---|---|---|---|---|---|---|
| Word2Vec | ||||||
| PureTextRNN | ||||||
| PinModelA | ||||||
| VTFN-RNN | ||||||
| VTFN-CNN | ||||||
| VTFN-BOW |
| Model | images2014 | images2015 | COCO-Test | Pin-Test |
|---|---|---|---|---|
| Glove (R) | ||||
| Glove (NR) | ||||
| PinModelA | ||||
| M-Skip-Gram (R) | ||||
| M-Skip-Gram (NR) | ||||
| PP-XXL (R) | ||||
| PP-XXL (NR) | ||||
| Best SemEval | N/A | N/A | ||
| VTFN-BOW (our) |
| Encoder | images2014 | images2015 | COCO-Test | Pin-Test |
|---|---|---|---|---|
| RNN-GRU | ||||
| RNN-LSTM | ||||
| BOW-SUM | ||||
| BOW-MEAN |
| Loss type | Avg score |
|---|---|
| Covariance | |
| Rank loss | |
| Pearson |
| Train Dataset | Word2Vec | PinModelA | VTFN-RNN | VTFN-BOW |
|---|---|---|---|---|
| MS COCO | ||||
| SBU | ||||
| Pinterest5M |
| Model | images2014 | images2015 | COCO-Test | Pin-Test |
|---|---|---|---|---|
| Word-level | ||||
| Sentence-level |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

