Submitted:
24 October 2025
Posted:
27 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

2. Related Work
2.1. Text-Based Visual Question Answering
2.2. Pointer Networks and Generative Decoding
2.3. Dynamic Neural Networks and Mixture-of-Experts Routing
2.4. Cross-Modal Fusion and Contextual Alignment

3. Methodology
3.1. Enhanced Visual and Textual Representations
3.1.1. Object-Level Feature Embedding
3.1.2. OCR Feature Representation
3.1.3. Question Encoding and Semantic Conditioning
3.2. Cross-Modal Fusion Transformer

3.3. Dual-Expert Answer Reasoning Module
3.3.1. Expert I: Classification-Based Prediction
3.3.2. Expert II: Dynamic Pointer Decoding
3.3.3. Expert Coordination via Gating Network
3.3.4. Joint Optimization Objective
3.4. Inference Strategy and Answer Decoding
4. Experiments
4.1. Benchmarks and Data Protocols
4.2. Evaluation Criteria
4.3. Training Setup and Hyperparameters
4.4. Main Results vs. State-of-the-Art
4.5. Ablation on OCR Feature Integration
4.6. Analysis of Dual-Branch Routing
4.7. Gating Behaviour and Calibration
4.8. Sensitivity to OCR Quality
4.9. Decoding Strategy and Answer Lengths
4.10. Computational Efficiency
| Model | #FLOPs (G) | Latency (ms) |
| VinVL-Base | 78.2 | 35.1 |
| Classifier-only (w/ OCR) | 80.6 | 36.4 |
| CMFN-Base (ours) | 82.1 | 38.2 |
4.11. Robustness to Noisy Text and Perturbations
4.12. Ablation of Architectural Components

4.13. Human Study and Qualitative Inspection
4.14. Limitations and Reproducibility Notes
4.15. Effect of OCR Feature
4.16. Effect of Dual Routing Prediction Module
4.17. Qualitative Analysis
5. Conclusion and Future Work
References
- Almazán, J., A. Gordo, A. Fornés, and E. Valveny. 2014. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36: 2552–2566. [Google Scholar] [CrossRef] [PubMed]
- Anderson, P., X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition; pp. 6077–6086. [Google Scholar]
- Antol, S., A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. Vqa: Visual question answering. Proceedings of the IEEE international conference on computer vision; pp. 2425–2433. [Google Scholar]
- Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5. [Google Scholar] [CrossRef]
- Borisyuk, F., A. Gordo, and V. Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. [Google Scholar]
- Chen, Y.-C., L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu. 2019. Uniter: Learning universal image-text representations. [Google Scholar]
- Dai, Z., and J. Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. [Google Scholar]
- Gan, Z., Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu. 2020. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems 33: 6616–6628. [Google Scholar]
- Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. 2020. Generative adversarial networks. Communications of the ACM 63: 139–144. [Google Scholar] [CrossRef]
- Goyal, Y., T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition; pp. 6904–6913. [Google Scholar]
- Gu, J., Z. Lu, H. Li, and V. O. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv arXiv:1603.06393. [Google Scholar]
- Guo, Y., J. Chen, J. Wang, Q. Chen, J. Cao, Z. Deng, Y. Xu, and M. Tan. 2020a. Closed-loop matters: Dual regression networks for single image super-resolution. IEEE/CVF conference on computer vision and pattern recognition. [Google Scholar]
- Guo, Y., Y. Chen, Y. Zheng, P. Zhao, J. Chen, J. Huang, and M. Tan. 2020b. Breaking the curse of space explosion: Towards efficient nas with curriculum search. International Conference on Machine Learning. PMLR. [Google Scholar]
- Guo, Y., D. Stutz, and B. Schiele. 2022. Improving robustness by enhancing weak subnets. European Conference on Computer Vision; Springer. [Google Scholar]
- Guo, Y., J. Wang, Q. Chen, J. Cao, Z. Deng, Y. Xu, J. Chen, and M. Tan. 2022. Towards lightweight super-resolution with dual regression learning. arXiv arXiv:2207.07929. [Google Scholar] [CrossRef]
- Guo, Y., Y. Zheng, M. Tan, Q. Chen, J. Chen, P. Zhao, and J. Huang. 2019. Nat: Neural architecture transformer for accurate and compact architectures. Advances in Neural Information Processing Systems 32. [Google Scholar]
- Han, W., H. Huang, and T. Han. 2020. Finding the evidence: Localization-aware answer prediction for text visual question answering. arXiv arXiv:2010.02582. [Google Scholar] [CrossRef]
- Han, Y., G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang. 2021. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Google Scholar]
- Hong, J., J. Fu, Y. Uh, T. Mei, and H. Byun. 2019. Exploiting hierarchical visual features for visual question answering. Neurocomputing 351: 187–195. [Google Scholar] [CrossRef]
- Hu, R., and A. Singh. 2021. Unit: Multimodal multitask learning with a unified transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision. [Google Scholar]
- Hu, R., A. Singh, T. Darrell, and M. Rohrbach. 2020. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [Google Scholar]
- Kenton, J. D. M.-W. C., and L. K. Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT. [Google Scholar]
- Li, X., X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, and et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. European Conference on Computer Vision; Springer. [Google Scholar]
- Lioutas, V., N. Passalis, and A. Tefas. 2018. Explicit ensemble attention learning for improving visual question answering. Pattern Recognition Letters 111: 51–57. [Google Scholar] [CrossRef]
- Lu, J., D. Batra, D. Parikh, and S. Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32. [Google Scholar]
- Lu, J., J. Yang, D. Batra, and D. Parikh. 2018. Neural baby talk. Proceedings of the IEEE conference on computer vision and pattern recognition; pp. 7219–7228. [Google Scholar]
- Ma, C., C. Shen, A. Dick, Q. Wu, P. Wang, A. van den Hengel, and I. Reid. 2018. Visual question answering with memory-augmented networks. IEEE conference on computer vision and pattern recognition. [Google Scholar]
- Nallapati, R., B. Zhou, C. Gulcehre, B. Xiang, and et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv arXiv:1602.06023. [Google Scholar]
- Patro, B., V. Kurmi, S. Kumar, and V. Namboodiri. 2020. Deep bayesian network for visual question generation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; pp. 1566–1576. [Google Scholar]
- Ren, S., K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. [Google Scholar]
- Shazeer, N., A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv arXiv:1701.06538. [Google Scholar]
- Singh, A., V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. 2019. Towards vqa models that can read. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [Google Scholar]
- Tan, H., and M. Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. [Google Scholar]
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30. [Google Scholar]
- Vinyals, O., M. Fortunato, and N. Jaitly. 2015. Pointer networks. In Advances in neural information processing systems. [Google Scholar]
- Wang, W., H. Bao, L. Dong, and F. Wei. 2021. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. arXiv arXiv:2111.02358. [Google Scholar]
- Yu, D., J. Fu, T. Mei, and Y. Rui. 2017. Multi-level attention networks for visual question answering. IEEE conference on computer vision and pattern recognition. [Google Scholar]
- Yu, F., J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang. 2021. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 4. [Google Scholar]
- Zhang, P., X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. IEEE/CVF Conference on Computer Vision and Pattern Recognition. [Google Scholar]
- Zhu, Q., C. Gao, P. Wang, and Q. Wu. 2020. Simple is not easy: A simple strong baseline for textvqa and textcaps. arXiv arXiv:2012.05153. [Google Scholar] [CrossRef]
- Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (Long Papers). [Google Scholar]
- Pires, Telmo, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. [Google Scholar]
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.
- Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2021. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 459–469.
- Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang,Wayne Xin Zhao, and Ji-RongWen. 2021. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, 4483–4491. Survey Track.
- Yunshi Lan and Jing Jiang. 2021. Modeling transitions of focal entities for conversational knowledge base question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
- Ilya Loshchilov and Frank Hutter. 2019. DecoupledWeight Decay Regularization. In International Conference on Learning Representations.
- Zhang, Meishan, Hao Fei, Bin Wang, Shengqiong Wu, Yixin Cao, Fei Li, and Min Zhang. 2024. Recognizing everything from all modalities at once: Grounded multimodal universal information extraction. In Findings of the Association for Computational Linguistics: ACL 2024. [Google Scholar]
- Wu, Shengqiong, Hao Fei, and Tat-Seng Chua. 2025. Universal scene graph generation. Proceedings of the CVPR. [Google Scholar]
- Wu, Shengqiong, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Tat-seng Chua. 2025. Learning 4d panoptic scene graph generation from rich 2d visual scene. Proceedings of the CVPR. [Google Scholar]
- Wang, Yaoting, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. 2025. Multimodal chain-of-thought reasoning: A comprehensive survey. arXiv arXiv:2503.12605. [Google Scholar]
- Fei, Hao, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, and Hanwang Zhang. 2025. On path to multimodal generalist: General-level and general-bench. Proceedings of the ICML. [Google Scholar]
- Li, Jian, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, and et al. 2024. A survey on benchmarks of multimodal large language models. arXiv arXiv:2408.08632. [Google Scholar] [CrossRef]
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, may 2015. [CrossRef]
- Dong Yu Li Deng. Deep Learning: Methods and Applications. NOW Publishers, May 2014. Available online: https://www.microsoft.com/en-us/research/publication/deep-learning-methods-and-applications/.
- Eric Makita and Artem Lenskiy. A movie genre prediction based on Multivariate Bernoulli model and genre correlations. (May), mar 2016a. Available online: http://arxiv.org/abs/1604.08608.
- Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv arXiv:1410.1090. [Google Scholar] [CrossRef]
- Deli Pei, Huaping Liu, Yulong Liu, and Fuchun Sun. Unsupervised multimodal feature learning for semantic image segmentation. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE, aug 2013. ISBN 978-1-4673-6129-3. doi:10.1109/IJCNN.2013.6706748. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6706748. [CrossRef]
- Simonyan, Karen, and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv arXiv:1409.1556. [Google Scholar]
- Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-Shot Learning Through Cross-Modal Transfer. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf.
- Fei, Hao, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. 2024a. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Google Scholar]
- Karpathy, A., and L. Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. TPAMI vol. 39, no. 4: 664–676. [Google Scholar] [CrossRef]
- Fei, Hao, Yafeng Ren, and Donghong Ji. 2020a. Retrofitting structure-aware transformer language model for end tasks. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing; pp. pages 2151–2161. [Google Scholar]
- Wu, Shengqiong, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, and Donghong Ji. 2022. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence; pp. 11513–11521. [Google Scholar]
- Shi, Wenxuan, Fei Li, Jingye Li, Hao Fei, and Donghong Ji. 2022. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: pages 4232–4241. [Google Scholar]
- Fei, Hao, Yue Zhang, Yafeng Ren, and Donghong Ji. 2020b. Latent emotion memory for multi-label emotion classification. Proceedings of the AAAI Conference on Artificial Intelligence; pp. pages 7692–7699. [Google Scholar]
- Wang, Fengqi, Fei Li, Hao Fei, Jingye Li, Shengqiong Wu, Fangfang Su, Wenxuan Shi, Donghong Ji, and Bo Cai. 2022. Entity-centered cross-document relation extraction. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; pp. pages 9871–9881. [Google Scholar]
- Zhuang, Ling, Hao Fei, and Po Hu. 2023. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion 100: 101919. [Google Scholar] [CrossRef]
- Yu, Adams Wei, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv arXiv:1804.09541. [Google Scholar] [CrossRef]
- Wu, Shengqiong, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. 2023a. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv arXiv:2305.11719. [Google Scholar]
- Xu, Jundong, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. 2024. Faithful logical reasoning via symbolic chain-of-thought. arXiv arXiv:2405.18357. [Google Scholar] [CrossRef]
- Dunn, Matthew, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv arXiv:1704.05179. [Google Scholar]
- Fei, Hao, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, and Tat-Seng Chua. 2022a. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. Proceedings of the Advances in Neural Information Processing Systems NeurIPS 2022; pp. 15460–15475. [Google Scholar]
- Qiu, Guang, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction through double propagation. Computational linguistics 37, 1: 9–27. [Google Scholar] [CrossRef]
- Fei, Hao, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. 2021. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics 22, 3. [Google Scholar] [CrossRef]
- Wu, Shengqiong, Hao Fei, Wei Ji, and Tat-Seng Chua. 2023b. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: 2593–2608. [Google Scholar]
- Bobo Li, Hao Fei, Fei Li, Tat-seng Chua, and Donghong Ji. 2024a. Multimodal emotion-cause pair extraction with holistic interaction and label constraint. ACM Transactions on Multimedia Computing, Communications and Applications (2024).
- Bobo Li, Hao Fei, Fei Li, Shengqiong Wu, Lizi Liao, Yinwei Wei, Tat-Seng Chua, and Donghong Ji. 2025. Revisiting conversation discourse for dialogue disentanglement. ACM Transactions on Information Systems 43, 1 (2025), 1–34.
- Bobo Li, Hao Fei, Fei Li, YuhanWu, Jinsong Zhang, ShengqiongWu, Jingye Li, Yijiang Liu, Lizi Liao, Tat-Seng Chua, and Donghong Ji. 2023. DiaASQ: A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis. In Findings of the Association for Computational Linguistics: ACL 2023. 13449–13467.
- Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Fangfang Su, Fei Li, and Donghong Ji. 2024b. Harnessing holistic discourse features and triadic interaction for sentiment quadruple extraction in dialogues. In Proceedings of the AAAI conference on artificial intelligence, Vol. 38. 18462–18470.
- Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, and Tat-Seng Chua. 2025a. Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 8460–8468. [CrossRef]
- Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, et al. 2025b. Any2caption: Interpreting any condition to caption for controllable video generation. arXiv preprint arXiv:2503.24379 (2025). arXiv:2503.24379.
- Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria, and Hao Fei. 2025. Towards multimodal empathetic response generation: A rich text-speech-vision avatar-based benchmark. In Proceedings of the ACM on Web Conference 2025. 2872–2881.
- Fei, Hao, Yafeng Ren, and Donghong Ji. 2020a. A tree-based neural network model for biomedical event trigger detection. Information Sciences 512: 175. [Google Scholar] [CrossRef]
- Fei, Hao, Yafeng Ren, and Donghong Ji. 2020b. Dispatched attention with multi-task learning for nested mention recognition. Information Sciences 513: 241. [Google Scholar] [CrossRef]
- 2021. Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. A span-graph neural model for overlapping entity relation extraction in biomedical texts. Bioinformatics 37: 1581. [CrossRef]
- Yu Zhao, Hao Fei, ShengqiongWu, Meishan Zhang, Min Zhang, and Tat-seng Chua. 2025. Grammar induction from visual, speech and text. Artificial Intelligence 341 (2025), 104306.
- Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv arXiv:1606.05250. [Google Scholar]
- Fei, Hao, Fei Li, Bobo Li, and Donghong Ji. 2021a. Encoder-decoder based unified semantic role labeling with label-aware syntax. Proceedings of the AAAI conference on artificial intelligence; pp. pages 12794–12802. [Google Scholar]
- Kingma, D. P., and J. Ba. 2015. Adam: A method for stochastic optimization. ICLR. [Google Scholar]
- Hao Fei, Shengqiong Wu, Yafeng Ren, Fei Li, and Donghong Ji. Better combine them together! integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 549–559, 2021b.
- Papineni, K., S. Roukos, T. Ward, and W. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. ACL, 311–318. [Google Scholar]
- Fei, Hao, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. 2023a. Reasoning implicit sentiment with chain-of-thought prompting. arXiv arXiv:2305.11255. [Google Scholar]
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023c.
- Li, Qimai, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. Thirty-Second AAAI Conference on Artificial Intelligence. [Google Scholar]
- Fei, Hao, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. 2024b. Video-of-thought: Step-by-step video reasoning from perception to cognition. Proceedings of the International Conference on Machine Learning. [Google Scholar]
- Jain, Naman, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, and et al. 2019. Agribot: agriculture-specific question answer system. In IndiaRxiv. [Google Scholar]
- Fei, Hao, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. 2024c. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; pp. pages 7641–7653. [Google Scholar]
- Mihir Momaya, Anjnya Khanna, Jessica Sadavarte, and Manoj Sankhe. Krushi–the farmer chatbot. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–6. IEEE, 2021.
- Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pages 4096–4103, 2022b.
- Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, and Jingye Li. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 3957–3963, 2021.
- Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5923–5934, 2023.
- Fei, Hao, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. 2023b. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics Volume 1: 5980–5994. [Google Scholar]
- Banerjee, S., and A. Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. IEEMMT, 65–72. [Google Scholar]
- Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2024,, 2024d.
- Abbott Chen and Chai Liu. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning, 11:1–10, 01 2021.
- Wu, Shengqiong, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. 2024. Towards semantic equivalence of tokenization in multimodal llm. arXiv arXiv:2406.05127. [Google Scholar] [CrossRef]
- Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. MRN: A locally and globally mention-based reasoning network for document-level relation extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370, 2021.
- Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Zhang. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML, pages 6373–6391, 2022c.
- Cao, Hu, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao, and Donghong Ji. 2022. OneEE: A one-stage framework for fast overlapping and nested event extraction. Proceedings of the 29th International Conference on Computational Linguistics; pp. pages 1953–1964. [Google Scholar]
- Gaddy Tende, Isakwisa, Kentaro Aburada, Hisaaki Yamaba, Tetsuro Katayama, and Naonobu Okazaki. 2021. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy 11: 2411. [Google Scholar] [CrossRef]
- Fei, Hao, Yafeng Ren, and Donghong Ji. 2020c. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management 57: 102311. [Google Scholar]
- Li, Jingye, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. 2022. Unified named entity recognition as word-word relation classification. Proceedings of the AAAI Conference on Artificial Intelligence; pp. pages 10965–10973. [Google Scholar]
- Jain, Mohit, Pratyush Kumar, Ishita Bhansali, Q Vera Liao, Khai Truong, and Shwetak Patel. 2018b. Farmchat: a conversational agent to answer farmer queries. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2: 1–22. [Google Scholar] [CrossRef]
- Wu, Shengqiong, Hao Fei, Hanwang Zhang, and Tat-Seng Chua. 2023d. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. Proceedings of the 37th International Conference on Neural Information Processing Systems; pp. pages 79240–79259. [Google Scholar]
- Anderson, P., B. Fernando, M. Johnson, and S. Gould. 2016. SPICE: semantic propositional image caption evaluation. ECCV, 382–398. [Google Scholar]
- Fei, Hao, Tat-Seng Chua, Chenliang Li, Donghong Ji, Meishan Zhang, and Yafeng Ren. 2023c. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems 41: 50:1–50:32. [Google Scholar] [CrossRef]
- Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, and Tat-Seng Chua. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5281–5291, 2023a.
- Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14734–14751, 2023e.
- Fei, Hao, Yafeng Ren, Yue Zhang, and Donghong Ji. 2023d. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems 34: 5544–5556. [Google Scholar] [CrossRef]
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.
- Yuksel, Seniha Esen, Joseph N Wilson, and Paul D Gader. 2012. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems 23: 1177–1193. [Google Scholar] [CrossRef]
- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
| Dataset | VQA v2.0 | |||
| Test-dev | Test-std | |||
| Model Size | Base | Large | Base | Large |
| UNITER | 72.27 | 73.24 | 72.46 | 73.40 |
| VILLA | 73.59 | 73.69 | 73.67 | 74.87 |
| ERNIE-VIL | 72.62 | 74.75 | 72.85 | 74.93 |
| Oscar | 73.16 | 73.61 | 73.44 | 73.82 |
| VinVL | 74.78 | 76.04 | 74.87 | 76.06 |
| CMFN (ours) | 75.62 | 76.88 | 75.71 | 76.95 |
| Method | OCR | Dual | Yes/No | Number | Other | All | ||||
| Feature | Routing | Dev | Std | Dev | Std | Dev | Std | Dev | Std | |
| VinVL | — | — | 90.62 | 90.73 | 56.28 | 55.63 | 65.45 | 65.47 | 74.78 | 74.87 |
| CMFN | 90.98 | 90.86 | 57.91 | 57.72 | 65.73 | 65.82 | 75.13 | 75.29 | ||
| 90.86 | 91.02 | 60.62 | 59.51 | 66.27 | 66.18 | 75.62 | 75.71 | |||
| Answer Source | OCR Token | Candidate Set |
| VinVL | 22.08 | 76.28 |
| CMFN (ours) | 58.43 | 75.86 |
| Model | Gate Acc. ↑ | Gate F1 ↑ | ECE (%) ↓ |
| Classifier-only (w/ OCR) | – | – | 5.9 |
| CMFN (Dual) | 98.6 | 0.82 | 5.1 |
| Model | Low | Med. | High |
| VinVL | 18.7 | 33.9 | 54.1 |
| CMFN (ours) | 41.6 | 52.7 | 63.9 |
| Method | ≤2 tokens | 3–4 tokens | ≥5 tokens |
| Greedy | 61.2 | 57.8 | 49.5 |
| Beam-5 | 61.9 | 59.6 | 52.3 |
| Model | Clean | CharFlip | Affine |
| VinVL | 74.8 | 70.9 | 69.4 |
| CMFN (ours) | 75.6 | 73.1 | 72.2 |
| Variant | Yes/No | Number | All |
| Full CMFN | 91.0 | 60.6 | 75.6 |
| w/o PHOC | 90.7 | 59.2 | 75.0 |
| w/o Appearance | 90.8 | 59.7 | 75.2 |
| w/o Spatial (bbox) | 90.6 | 58.8 | 74.9 |
| w/o | 90.9 | 59.6 | 75.1 |
| w/o | 90.7 | 59.4 | 75.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).