Submitted:
18 February 2025
Posted:
20 February 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Heterogeneous Modality Gap: The intrinsic differences between visual and linguistic representations make it challenging to establish direct correspondence between images and textual descriptions. Traditional approaches that enforce global embedding similarity between images and captions often fail to capture fine-grained semantic nuances.
- Noisy Pseudo-Labels: Existing pseudo-labeling techniques suffer from error propagation, where incorrect captions generated by an initially weak model degrade learning performance. A more reliable supervisory signal is needed to guide caption refinement without amplifying model biases.
- Lack of Structural Awareness: Current methods primarily focus on instance-level captioning without considering the relational structure of generated captions. Since real-world captions inherently encode relational knowledge (e.g., spatial relationships, object interactions), it is crucial to ensure that generated captions preserve these semantic relationships across different augmentations of the same image.
- Prediction Consistency as Soft Label Supervision: Instead of relying on rigid pseudo-labeling, we introduce a soft-labeling mechanism where semantic predictions extracted from raw image representations serve as supervision signals for caption generation. This technique enhances the reliability of self-supervision and mitigates the issue of noisy pseudo-labels.
- Relational Consistency for Structural Alignment: We introduce a novel relational consistency loss that ensures the semantic relationships between objects and concepts in generated captions align with those present in the visual domain. This approach improves the coherence and contextual relevance of generated captions.
- A Flexible and Scalable Framework: SCPRC can be easily integrated into existing captioning models, such as Transformer-based or CNN-RNN-based architectures, without requiring major modifications. Furthermore, our approach effectively scales to scenarios with varying degrees of supervision, making it suitable for practical deployment.
- Superior Performance on Benchmark Datasets: Our extensive experiments on the MS-COCO dataset demonstrate that SCPRC significantly outperforms state-of-the-art semi-supervised image captioning methods. We achieve at least 12% improvements in CIDEr-D scores, showcasing the efficacy of our method in handling non-parallel and weakly supervised scenarios.
2. Related Work
2.1. Advancements in Image Captioning
2.2. Innovations in Semi-Supervised Learning
3. Our Methodology
3.1. Preliminaries
3.2. The SCPRF Framework
- Encoder-Decoder Model: The image is first passed through an encoder E, typically a deep convolutional neural network, which extracts a dense representation of the visual content. The decoder D, often an LSTM or Transformer model, then translates this representation into a coherent sequence of words , forming the caption.
- Attention Mechanism: Within both E and D, attention mechanisms dynamically focus on different regions of the image and different segments of the generated text, respectively, to improve the relevance and accuracy of the caption generation.
3.3. Supervised and Unsupervised Loss Components
3.3.1. Generation Loss
3.3.2. Prediction Loss
3.3.3. Unsupervised Learning via Consistency
- Predictive Consistency: Encourages the model to maintain consistent predictions across different augmentations or perturbations of the same image, enhancing robustness and reliability.
- Relational Consistency: Goes beyond individual consistency by ensuring that relational dynamics—such as the relative positions and interactions between objects within images—are preserved in the transition from visual to textual representation.
3.4. Comprehensive Loss Function
3.5. Implementation Details
4. Experiments
4.1. Dataset and Experimental Setup
- BLEU@N (B@N) [44]: Measures n-gram precision between generated captions and ground truth.
- METEOR (M) [45]: Considers synonym matching and stemming for improved alignment.
- ROUGE-L (R) [39]: Evaluates sequence overlap recall.
- CIDEr-D (C) [34]: Measures consensus-based n-gram similarity with human-written captions.
- SPICE (S) [46]: Uses scene graphs to assess semantic correctness.
4.2. Implementation Details
- Encoder: A ResNet-101 [31] CNN extracts image features, followed by an attention mechanism to focus on important regions.
- Decoder: A recurrent LSTM-based sequence generator converts the visual representation into textual descriptions.
- Shared Classifier: A three-layer fully connected network with hidden layers of size 1024, used for prediction consistency learning.
- Data Augmentation: Each image undergoes three augmentations () using a random occlusion strategy.
- Optimizer: Adam [43] with an initial learning rate of , decayed by 0.8 every 3 epochs.
- Batch Size: 16 images per batch.
- Training Duration: The model is trained for 40 epochs on an NVIDIA TITAN X GPU.
- Hyperparameters: Loss function weights ; confidence threshold .
4.3. Comparison with State-of-the-Art Methods
- Semi-Supervised Model: A3VSE [15].
4.4. Quantitative Performance Analysis
4.5. Ablation Study
4.6. Effect of Augmentations and Threshold
4.7. Generalization Across Captioning Models
| Methods | Cross-Entropy Loss | |||||||
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
| SCST | 56.3 | 38.1 | 25.0 | 16.0 | 15.8 | 42.1 | 38.6 | 9.1 |
| GIC | 62.8 | 46.5 | 32.9 | 19.7 | 19.0 | 50.1 | 50.2 | 12.2 |
| SCST+SCPRF | 63.0 | 45.5 | 31.5 | 21.2 | 19.2 | 45.6 | 47.8 | 10.0 |
| GIC+SCPRF | 66.3 | 47.2 | 34.2 | 21.1 | 19.1 | 50.6 | 57.2 | 13.2 |
| Methods | CIDEr-D Score Optimization | |||||||
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
| SCST | 58.9 | 39.2 | 25.1 | 16.1 | 16.8 | 42.5 | 43.2 | 9.6 |
| GIC | 64.4 | 46.6 | 31.8 | 20.5 | 18.8 | 47.6 | 55.4 | 12.3 |
| SCST+SCPRF | 66.1 | 47.7 | 33.5 | 22.5 | 20.2 | 47.6 | 48.3 | 10.5 |
| GIC+SCPRF | 66.6 | 47.5 | 34.5 | 21.6 | 19.7 | 47.9 | 58.4 | 13.4 |
| Methods | Cross-Entropy Loss | |||||||
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
| 10% | 67.9 | 49.2 | 34.5 | 23.0 | 21.0 | 49.3 | 71.2 | 14.4 |
| 40% | 66.4 | 48.4 | 33.8 | 23.1 | 22.5 | 49.2 | 72.3 | 15.3 |
| 70% | 68.0 | 50.1 | 35.2 | 24.1 | 22.6 | 50.0 | 73.8 | 15.7 |
| 100% | 68.5 | 50.7 | 35.5 | 24.6 | 22.7 | 50.1 | 77.1 | 16.0 |
| Methods | CIDEr-D Score Optimization | |||||||
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
| 10% | 68.2 | 50.5 | 25.1 | 23.5 | 22.1 | 50.3 | 73.5 | 14.7 |
| 40% | 68.9 | 49.9 | 35.1 | 23.8 | 22.5 | 50.5 | 75.1 | 15.7 |
| 70% | 69.1 | 51.0 | 36.1 | 24.5 | 22.6 | 50.6 | 76.0 | 16.0 |
| 100% | 69.6 | 51.5 | 36.4 | 25.2 | 23.0 | 50.6 | 78.2 | 16.5 |
4.8. Sensitivity to Parameters
5. Conclusions and Future Directions
5.1. Conclusions
5.2. Future Research Directions
References
- Baltrusaitis, T.; Ahuja, C.; Morency, L. Multimodal machine learning: A survey and taxonomy. IEEE TPAMI 2019, 41, 423–443. [Google Scholar] [CrossRef]
- Debie, E.S.; Rojas, R.F.; Fidock, J.; Barlow, M.; Kasmarik, K.; Anavatti, S.G.; Garratt, M.; Abbass, H.A. Multimodal fusion for objective assessment of cognitive workload: A review. IEEE Trans. Cybern. 2021, 51, 1542–1555. [Google Scholar] [CrossRef] [PubMed]
- Karpathy, A.; Li, F. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR; 2015; pp. 3128–3137. [Google Scholar] [CrossRef]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML; 2015; pp. 2048–2057. [Google Scholar] [CrossRef]
- Sammani, F.; Elsayed, M. Look and modify: Modification networks for image captioning. In Proceedings of the BMVC; 2019; p. 75. [Google Scholar] [CrossRef]
- Bin, Y.; Yang, Y.; Shen, F.; Xie, N.; Shen, H.T.; Li, X. Describing video with attention-based bidirectional LSTM. IEEE Trans. Cybern. 2019, 49, 2631–2641. [Google Scholar] [CrossRef] [PubMed]
- Yang, Z.; Yuan, Y.; Wu, Y.; Cohen, W.W.; Salakhutdinov, R. Review networks for caption generation. In Proceedings of the NeurIPS; 2016; pp. 2361–2369. [Google Scholar] [CrossRef]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the CVPR; 2017; pp. 3242–3250. [Google Scholar] [CrossRef]
- Huang, L.; Wang, W.; Chen, J.; Wei, X. Attention on attention for image captioning. In Proceedings of the ICCV; 2019; pp. 4633–4642. [Google Scholar] [CrossRef]
- Hashimoto, T.B.; Guu, K.; Oren, Y.; Liang, P. A retrieve-and-edit framework for predicting structured outputs. In Proceedings of the NeurIPS; 2018; pp. 10073–10083. [Google Scholar] [CrossRef]
- Feng, Y.; Ma, L.; Liu, W.; Luo, J. Unsupervised image captioning. In Proceedings of the CVPR, Long Beach, CA; 2019; pp. 4125–4134. [Google Scholar] [CrossRef]
- Gu, J.; Joty, S.R.; Cai, J.; Zhao, H.; Yang, X.; Wang, G. Unpaired image captioning via scene graph alignments. In Proceedings of the ICCV, Seoul, Korea; 2019; pp. 10322–10331. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative adversarial nets. In Proceedings of the NeurIPS, Montreal, Canada; 2014; pp. 2672–2680. [Google Scholar]
- Mithun, N.C.; Panda, R.; Papalexakis, E.E.; Roy-Chowdhury, A.K. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the ACMMM; 2018; pp. 1856–1864. [Google Scholar] [CrossRef]
- Huang, P.; Kang, G.; Liu, W.; Chang, X.; Hauptmann, A.G. Annotation efficient cross-modal retrieval with adversarial attentive alignment. In Proceedings of the ACMMM; 2019; pp. 1758–1767. [Google Scholar] [CrossRef]
- Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the CVPR; 2019; pp. 3967–3976. [Google Scholar] [CrossRef]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the CVPR; 2017; pp. 1179–1195. [Google Scholar] [CrossRef]
- Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; Zhang, H. More grounded image captioning by distilling image-text matching model. In Proceedings of the CVPR; 2020; pp. 4776–4785. [Google Scholar] [CrossRef]
- Yao, B.Z.; Yang, X.; Lin, L.; Lee, M.W.; Zhu, S.C. I2T: image parsing to text description. IEEE, 2010; vol. 98, no. 8; pp. 1485–1508. [Google Scholar] [CrossRef]
- Cho, K.; van Merrienboer, B.; Gulccehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP; 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the CVPR; 2015; pp. 3156–3164. [Google Scholar] [CrossRef]
- Grandvalet, Y.; Bengio, Y. Semi-supervised learning by entropy minimization. In Proceedings of the NeurIPS; 2004; pp. 529–536. [Google Scholar]
- Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.E.; McGuinness, K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In Proceedings of the IJCNN; 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Bachman, P.; Alsharif, O.; Precup, D. Learning with pseudo-ensembles. In Proceedings of the NeurIPS; 2014; pp. 3365–3373. [Google Scholar] [CrossRef]
- Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the NeurIPS, Long Beach, CA; 2017; pp. 1195–1204. [Google Scholar] [CrossRef]
- Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. In Proceedings of the ICLR, Toulon, France; 2017. [Google Scholar] [CrossRef]
- Xie, Q.; Dai, Z.; Hovy, E.H.; Luong, T.; Le, Q. Unsupervised data augmentation for consistency training. In Proceedings of the NeurIPS; 2020. [Google Scholar]
- Berthelot, D.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Sohn, K.; Zhang, H.; Raffel, C. Remixmatch: Semisupervised learning with distribution matching and augmentation anchoring. In Proceedings of the ICLR; 2020. [Google Scholar] [CrossRef]
- French, G.; Mackiewicz, M.; Fisher, M.H. Self-ensembling for visual domain adaptation. In Proceedings of the ICLR; 2018. [Google Scholar] [CrossRef]
- Sohn, K.; Berthelot, D.; Li, C.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Proceedings of the CoRR; 2020; vol. abs/2001.07685. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV; 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR, San Diego, CA; 2015. [Google Scholar] [CrossRef]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the CVPR; 2015; pp. 4566–4575. [Google Scholar] [CrossRef]
- Ranzato, M.; Chopra, S.; Auli, M.; Zaremba, W. Sequence level training with recurrent neural networks. In Proceedings of the ICLR, San Juan, Puerto Rico; Bengio, Y., LeCun, Y., Eds.; 2016. [Google Scholar] [CrossRef]
- Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. In Proceedings of the CoRR; 2018; vol. abs/1805.09501. [Google Scholar] [CrossRef]
- Lin, Y.; Wang, C.; Chang, C.; Sun, H. An efficient framework for counting pedestrians crossing a line using low-cost devices: the benefits of distilling the knowledge in a neural network. Multim. Tools Appl. 2021, 80, 4037–4051. [Google Scholar] [CrossRef]
- Matthews, P. A short history of structural linguistics. 2001. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the ECCV; 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Huang, L.; Wang, W.; Xia, Y.; Chen, J. Adaptively aligned image captioning via adaptive attention time. In Proceedings of the NeurIPS; 2019; pp. 8940–8949. [Google Scholar] [CrossRef]
- Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image captioning: Transforming objects into words. In Proceedings of the NeurIPS; 2019; pp. 11135–11145. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. TPAMI 2017, 39, 664–676. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the ICLR; 2015. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the ACL; 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Banerjee, S.; Lavie, A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the IEEMMT; 2005; pp. 65–72. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: semantic propositional image caption evaluation. In Proceedings of the ECCV; 2016; pp. 382–398. [Google Scholar] [CrossRef]
- Bastos, A.; Nadgeri, A.; Singh, K.; Mulang, I.O.; Shekarpour, S.; Hoffart, J.; Kaul, M. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network. In Proceedings of the Web Conference 2021; pp. 1673–1685. [CrossRef]
- Christmann, P.; Saha Roy, R.; Abujabal, A.; Singh, J.; Weikum, G. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM; 2019; pp. 729–738. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Kacupaj, E.; Singh, K.; Maleshkova, M.; Lehmann, J. An Answer Verbalization Dataset for Conversational Question Answerings over Knowledge Graphs. arXiv 2022, arXiv:2208.06734. [Google Scholar] [CrossRef]
- Kaiser, M.; Saha Roy, R.; Weikum, G. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; 2021; pp. 459–469. [Google Scholar] [CrossRef]
- Lan, Y.; He, G.; Jiang, J.; Jiang, J.; Zhao, W.X.; Wen, J.R. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21; International Joint Conferences on Artificial Intelligence Organization, 2021; pp. 4483–4491, Survey Track. [Google Scholar] [CrossRef]
- Lan, Y.; Jiang, J. Modeling transitions of focal entities for conversational knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); 2021. [Google Scholar] [CrossRef]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020, 7871–7880. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations 2019. [Google Scholar] [CrossRef]
- Marion, P.; Nowak, P.K.; Piccinno, F. Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021). [CrossRef]
- Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 2010, 16, 345–379. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Deng, L.; Yu, D. Deep Learning: Methods and Applications. NOW Publishers, May 2014. Available online: https://www.microsoft.com/en-us/research/publication/deep-learning-methods-and-applications/.
- Makita, E.; Lenskiy, A. A movie genre prediction based on Multivariate Bernoulli model and genre correlations. 2016. [Google Scholar]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Explain images with multimodal recurrent neural networks. arXiv 2014, arXiv:1410.1090. [Google Scholar] [CrossRef]
- Pei, D.; Liu, H.; Liu, Y.; Sun, F. Unsupervised multimodal feature learning for semantic image segmentation. In Proceedings of the 2013 International Joint Conference on Neural Networks (IJCNN); IEEE, 2013; pp. 1–6. ISBN 978-1-4673-6129-3. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Socher, R.; Ganjoo, M.; Manning, C.D.; Ng, A. Zero-Shot Learning Through Cross-Modal Transfer. In Advances in Neural Information Processing Systems; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc., 2013; Volume 26, pp. 935–943. Available online: http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf.
- Fei, H.; Wu, S.; Zhang, M.; Zhang, M.; Chua, T.S.; Yan, S. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024. [Google Scholar] [CrossRef]
- Fei, H.; Ren, Y.; Ji, D. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing; 2020; pp. 2151–2161. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Li, F.; Zhang, M.; Liu, Y.; Teng, C.; Ji, D. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence; 2022; pp. 11513–11521. [Google Scholar] [CrossRef]
- Shi, W.; Li, F.; Li, J.; Fei, H.; Ji, D. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2022; pp. 4232–4241. [Google Scholar] [CrossRef]
- Fei, H.; Zhang, Y.; Ren, Y.; Ji, D. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence; 2020; pp. 7692–7699. [Google Scholar] [CrossRef]
- Wang, F.; Li, F.; Fei, H.; Li, J.; Wu, S.; Su, F.; Shi, W.; Ji, D.; Cai, B. Entity-centered cross-document relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; 2022; pp. 9871–9881. [Google Scholar] [CrossRef]
- Zhuang, L.; Fei, H.; Hu, P. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion 2023, 100, 101919. [Google Scholar] [CrossRef]
- Yu, A.W.; Dohan, D.; Luong, M.T.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q.V. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv 2018, arXiv:1804.09541. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Cao, Y.; Bing, L.; Chua, T.-S. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv 2023, arXiv:2305.11719. [Google Scholar] [CrossRef]
- Xu, J.; Fei, H.; Pan, L.; Liu, Q.; Lee, M.-L.; Hsu, W. Faithful logical reasoning via symbolic chain-of-thought. arXiv 2024, arXiv:2405.18357. [Google Scholar] [CrossRef]
- Dunn, M.; Sagun, L.; Higgins, M.; Guney, V.U.; Cirik, V.; Cho, K. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv arXiv:1704.05179, 2017. [CrossRef]
- Fei, H.; Wu, S.; Li, J.; Li, B.; Li, F.; Qin, L.; Zhang, M.; Zhang, M.; Chua, T.S. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022; 2022; pp. 15460–15475. [Google Scholar] [CrossRef]
- Qiu, G.; Liu, B.; Bu, J.; Chen, C. Opinion word expansion and target extraction through double propagation. Computational Linguistics 2011, 37, 9–27. [Google Scholar] [CrossRef]
- Fei, H.; Ren, Y.; Zhang, Y.; Ji, D.; Liang, X. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics 2020, 22, 2021. [Google Scholar] [CrossRef] [PubMed]
- Wu, S.; Fei, H.; Ji, W.; Chua, T.-S. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023; pp. 2593–2608. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv arXiv:1606.05250, 2016. [CrossRef]
- Fei, H.; Li, F.; Li, B.; Ji, D. Encoder-decoder based unified semantic role labeling with label-aware syntax. In Proceedings of the AAAI conference on artificial intelligence; 2021; pp. 12794–12802. [Google Scholar] [CrossRef]
- Fei, H.; Wu, S.; Ren, Y.; Li, F.; Ji, D. Better combine them together! integrating syntactic constituency and dependency representations for semantic role labeling. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; 2021; pp. 549–559. [Google Scholar] [CrossRef]
- Fei, H.; Li, B.; Liu, Q.; Bing, L.; Li, F.; Chua, T.-S. Reasoning implicit sentiment with chain-of-thought prompting. arXiv 2023, arXiv:2305.11255. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019; Association for Computational Linguistics; pp. 4171–4186. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Qu, L.; Ji, W.; Chua, T.-S. Next-gpt: Any-to-any multimodal llm. In Proceedings of the CoRR; 2023; vol. abs/2309.05519. [Google Scholar] [CrossRef]
- Li, Q.; Han, Z.; Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence; 2018. [Google Scholar]
- Fei, H.; Wu, S.; Ji, W.; Zhang, H.; Zhang, M.; Lee, M.-L.; Hsu, W. Videoof-thought: Step-by-step video reasoning from perception to cognition. In Proceedings of the International Conference on Machine Learning; 2024. [Google Scholar] [CrossRef]
- Jain, N.; Jain, P.; Kayal, P.; Sahit, J.; Pachpande, S.; Choudhari, J.; et al. Agribot: agriculture-specific question answer system. IndiaRxiv 2019. [Google Scholar] [CrossRef]
- Fei, H.; Wu, S.; Ji, W.; Zhang, H.; Chua, T.-S. Dysen-vdm: Empowering dynamicsaware text-to-video diffusion with llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024; pp. 7641–7653. [Google Scholar] [CrossRef]
- Momaya, M.; Khanna, A.; Sadavarte, J.; Sankhe, M. Krushi–the farmer chatbot. In Proceedings of the 2021 International Conference on Communication information and Computing Technology (ICCICT); IEEE, 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Fei, H.; Li, F.; Li, C.; Wu, S.; Li, J.; Ji, D. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI; 4103; pp. 4096–4103. [Google Scholar]
- Wu, S.; Fei, H.; Ren, Y.; Ji, D.; Li, J. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence; 2021; pp. 3957–3963. [Google Scholar] [CrossRef]
- Li, B.; Fei, H.; Liao, L.; Zhao, Y.; Teng, C.; Chua, T.-S.; Ji, D.; Li, F. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM; 2023; pp. 5923–5934. [Google Scholar] [CrossRef]
- Fei, H.; Liu, Q.; Zhang, M.; Zhang, M.; Chua, T.-S. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023; pp. 5980–5994. [Google Scholar] [CrossRef]
- Fei, H.; Wu, S.; Zhang, H.; Chua, T.-S.; Yan, S. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2024; 2024. [Google Scholar] [CrossRef]
- Arora, S.; Liang, Y.; Ma, T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the ICLR; 2017. [Google Scholar]
- Chen, A.; Liu, C. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning 2021, 11, 1–10. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Li, X.; Ji, J.; Zhang, H.; Chua, T.-S.; Yan, S. Towards semantic equivalence of tokenization in multimodal llm. arXiv 2024, arXiv:2406.05127. [Google Scholar] [CrossRef]
- Li, J.; Xu, K.; Li, F.; Fei, H.; Ren, Y.; Ji, D. MRN: A locally and globally mentionbased reasoning network for document-level relation extraction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; 2021; pp. 1359–1370. [Google Scholar] [CrossRef]
- Fei, H.; Wu, S.; Ren, Y.; Zhang, M. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML; 2022; pp. 6373–6391. [Google Scholar]
- Cao, H.; Li, J.; Su, F.; Li, F.; Fei, H.; Wu, S.; Li, B.; Zhao, L.; Ji, D. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International Conference on Computational Linguistics; 2022; pp. 1953–1964. [Google Scholar]
- Tende, I.G.; Aburada, K.; Yamaba, H.; Katayama, T.; Okazaki, N. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy 2021, 11, 2411. [Google Scholar] [CrossRef]
- Fei, H.; Ren, Y.; Ji, D. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management 2020, 57, 102311. [Google Scholar] [CrossRef]
- Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence; 2022; pp. 10965–10973. [Google Scholar] [CrossRef]
- Jain, M.; Kumar, P.; Bhansali, I.; Liao, Q.V.; Truong, K.; Patel, S. Farmchat: a conversational agent to answer farmer queries. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies; 2018; Volume 2, pp. 1–22. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Zhang, H.; Chua, T.-S. Imagine that! abstract-to-intricate text-toimage synthesis with scene graph hallucination diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems; 2023; pp. 79240–79259. [Google Scholar]
- Fei, H.; Chua, T.-S.; Li, C.; Ji, D.; Zhang, M.; Ren, Y. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems 2023, 41, 50:1–50:32. [Google Scholar] [CrossRef]
- Zhao, Y.; Fei, H.; Cao, Y.; Li, B.; Zhang, M.; Wei, J.; Zhang, M.; Chua, T.-S. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM; 2023; pp. 5291–5291. [Google Scholar] [CrossRef]
- Wu, S.; Fei, H.; Cao, Y.; Bing, L.; Chua, T.-S. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023; pp. 14734–14734. [Google Scholar] [CrossRef]
- Fei, H.; Ren, Y.; Zhang, Y.; Ji, D. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems 2023, 34, 5544–5556. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Fei, H.; Wei, J.; Zhang, M.; Zhang, M.; Chua, T.-S. Generating visual spatial description via holistic 3D scene understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2023; pp. 7960–7977. [Google Scholar] [CrossRef]
| Methods | Cross-Entropy Loss | CIDEr-D Score Optimization | ||||||||||||||
| B@1 | B@2 | B@3 | B@4 | M | R | C | S | B@1 | B@2 | B@3 | B@4 | M | R | C | S | |
| SCST | 55.9 | 37.3 | 24.5 | 15.6 | 15.4 | 41.2 | 37.2 | 8.8 | 58.7 | 38.2 | 24.3 | 15.6 | 16.5 | 42.0 | 42.1 | 9.3 |
| AoANet | 66.2 | 48.5 | 33.9 | 22.5 | 20.1 | 48.1 | 67.3 | 13.9 | 65.5 | 47.5 | 33.2 | 22.7 | 21.0 | 47.6 | 68.8 | 14.6 |
| AAT | 62.5 | 45.1 | 30.9 | 20.7 | 18.5 | 46.4 | 56.9 | 11.9 | 65.6 | 47.1 | 32.5 | 21.6 | 19.8 | 46.9 | 62.0 | 12.7 |
| ORT | 63.0 | 45.0 | 31.1 | 20.8 | 18.9 | 46.0 | 59.7 | 12.2 | 64.6 | 45.7 | 31.2 | 20.9 | 19.6 | 46.3 | 60.7 | 12.9 |
| GIC | 62.1 | 45.9 | 32.5 | 19.3 | 18.5 | 49.2 | 49.4 | 11.8 | 63.9 | 46.0 | 31.3 | 20.1 | 18.3 | 46.6 | 54.2 | 12.0 |
| Graph-align | - | - | - | - | - | - | - | - | 66.4 | 47.0 | 31.5 | 21.0 | 20.2 | 46.5 | 67.9 | 14.3 |
| UIC | - | - | - | - | - | - | - | - | 40.2 | 21.7 | 10.5 | 5.1 | 11.7 | 27.9 | 27.5 | 7.7 |
| A3VSE | 66.9 | 49.0 | 34.1 | 22.8 | 20.0 | 48.4 | 67.9 | 14.2 | 66.2 | 48.5 | 34.3 | 23.7 | 21.3 | 48.5 | 70.8 | 15.0 |
| AoANet+P | 66.6 | 49.1 | 34.6 | 23.6 | 21.8 | 48.5 | 70.2 | 14.6 | 66.5 | 49.0 | 35.2 | 23.7 | 21.2 | 49.4 | 72.7 | 15.4 |
| AoANet+C | 66.3 | 48.8 | 34.5 | 23.8 | 22.3 | 49.1 | 69.9 | 14.6 | 67.0 | 48.9 | 34.8 | 24.0 | 21.6 | 49.2 | 72.4 | 15.3 |
| SCPRF | 68.0 | 50.6 | 35.1 | 24.4 | 22.6 | 49.8 | 76.0 | 15.8 | 69.1 | 51.2 | 36.1 | 25.0 | 23.0 | 50.3 | 76.8 | 16.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).