Submitted:
27 January 2025
Posted:
29 January 2025
You are already at the latest version
Abstract
Multimodal reasoning tasks, which require integrating and processing diverse modalities such as vision and language, are critical for developing intelligent systems. In this paper, we propose AMCI-MLLM (Adaptive Multimodal Context Integration for Multimodal Large Language Models), a novel generative model that dynamically adjusts the contributions of different modalities based on task-specific queries. The core innovation of our method lies in a context-aware gating mechanism integrated within cross-modal attention layers, enabling fine-grained multimodal reasoning. To optimize learning, we introduce a two-stage training strategy: task-specific pretraining and adaptive fine-tuning with curriculum learning. Our experiments show that AMCI-MLLM achieves state-of-the-art performance on benchmarks such as VQAv2, TextVQA, and COCO Captions, outperforming existing models in accuracy, relevance, and fluency. Extensive analyses further highlight its scalability, robustness to noisy inputs, and enhanced interpretability. These findings showcase the potential of AMCI-MLLM to address key challenges in multimodal reasoning tasks and provide a robust framework for future research in this domain.
Keywords:
1. Introduction
- We identify key challenges in existing multimodal large language models, particularly their limitations in dynamically integrating task-relevant multimodal information, and propose a novel adaptive mechanism to address these challenges.
- We introduce AMCI-MLLM, a context-aware training framework with a dynamic gating mechanism that enhances cross-modal reasoning and enables efficient and robust learning from diverse multimodal datasets.
- We demonstrate the effectiveness of AMCI-MLLM through comprehensive experiments on multiple benchmarks, achieving state-of-the-art performance and highlighting its versatility in handling complex multimodal reasoning tasks.
2. Related Work
2.1. Multimodal Reasoning
2.2. Multimodal Large Language Models
3. Method
3.1. Model Architecture
- 1.
- Vision Encoder (): A pretrained vision transformer (ViT) extracts high-level visual features from input images. The output features are represented as:where is the input image, is the number of visual tokens, and is the feature dimension.
- 2.
- Text Encoder (): A large pretrained language model encodes input text sequences into token embeddings:where is the input text, is the number of textual tokens, and is the text embedding dimension.
- 3.
- Context-Aware Gating Mechanism: A dynamic mechanism that computes attention weights for visual and textual features, enabling task-dependent fusion.
3.2. Context-Aware Gating Mechanism
3.3. Training Strategy
3.3.1. Task-Specific Pretraining
Contrastive Loss
Reconstruction Loss
Classification Loss
3.3.2. Adaptive Fine-Tuning
3.4. Inference
4. Experiments
4.1. Experimental Setup
- VQAv2: A visual question answering dataset requiring models to infer correct answers based on images and textual questions.
- TextVQA: A dataset focusing on text-related questions within images, emphasizing the need for visual-textual reasoning.
- COCO Captions: An image captioning dataset aimed at generating descriptive and coherent captions for images.
- BLIP2: A vision-language model that combines a vision encoder with a large language model.
- InstructBLIP: An instruction-tuned variant of BLIP2 designed for multimodal understanding.
- LLaVA-1.5: A multimodal large language model that fine-tunes Vicuna with vision data.
4.2. Comparison with Baselines
4.3. Ablation Study
4.4. Human Evaluation
4.5. Analysis and Insights
4.6. Multifaceted Analysis of AMCI-MLLM
4.6.1. Generalization to Unseen Tasks
4.6.2. Efficiency and Scalability
- Trainable Parameters: AMCI-MLLM fine-tunes only 5% of the total parameters, significantly less than the 10% required by LLaVA-1.5.
- Training Time: Due to the parameter-efficient design, AMCI-MLLM requires 25% less time to fine-tune on the VQAv2 dataset compared to LLaVA-1.5.
4.6.3. Robustness to Noisy Inputs
- Noisy Images: We downsample images to simulate low-quality inputs.
- Ambiguous Questions: We replace key question tokens with synonyms or less specific terms.
4.6.4. Alignment Across Modalities
5. Conclusions
References
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734 2023.
- Ratzlaff, N.; Luo, M.; Su, X.; Lal, V.; Howard, P. Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning. arXiv preprint arXiv:2412.03467 2024.
- Hao, Y.; Gu, J.; Wang, H.W.; Li, L.; Yang, Z.; Wang, L.; Cheng, Y. Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark. arXiv preprint arXiv:2501.05444 2025.
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024; Association for Computational Linguistics, 2024. pp. 15890–15902. [Google Scholar]
- Liu, W.; Li, J.; Zhang, X.; Zhou, F.; Cheng, Y.; He, J. Diving into Self-Evolving Training for Multimodal Reasoning. arXiv preprint arXiv:2412.17451 2024.
- Lee, J.; Wang, Y.; Li, J.; Zhang, M. Multimodal Reasoning with Multimodal Knowledge Graph. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024; Ku, L.; Martins, A.; Srikumar, V., Eds. Association for Computational Linguistics, 2024, pp. 10767–10782. [CrossRef]
- Zhou, Q.; Zhou, R.; Hu, Z.; Lu, P.; Gao, S.; Zhang, Y. Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models. arXiv preprint arXiv:2405.13872 2024.
- Zhou, Y.; Long, G. Multimodal Event Transformer for Image-guided Story Ending Generation. In Proceedings of the Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3434–3444.
- Zhou, Y. Sketch storytelling. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4748–4752.
- Zhou, Y.; Long, G. Style-Aware Contrastive Learning for Multi-Style Image Captioning. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 2257–2267.
- Lin, H.; Chen, Z.; Luo, Z.; Cheng, M.; Ma, J.; Chen, G. CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models. arXiv preprint arXiv:2405.00390 2024.
- Wang, H.; Rangapur, A.; Xu, X.; Liang, Y.; Gharwi, H.; Yang, C.; Shu, K. Piecing It All Together: Verifying Multi-Hop Multimodal Claims. arXiv preprint arXiv:2411.09547 2024.
- Dong, G.; Zhang, C.; Deng, M.; Zhu, Y.; Dou, Z.; Wen, J. Progressive Multimodal Reasoning via Active Retrieval. CoRR 2024, abs/2412.14835, [2412.14835]. [CrossRef]
- Yan, Y.; Su, J.; He, J.; Fu, F.; Zheng, X.; Lyu, Y.; Wang, K.; Wang, S.; Wen, Q.; Hu, X. A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges. CoRR 2024, abs/2412.11936, [2412.11936]. [CrossRef]
- Zhang, N.; Li, L.; Chen, X.; Liang, X.; Deng, S.; Chen, H. Multimodal Analogical Reasoning over Knowledge Graphs. In Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Tao, Z.; Lin, T.E.; Chen, X.; Li, H.; Wu, Y.; Li, Y.; Jin, Z.; Huang, F.; Tao, D.; Zhou, J. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387 2024.
- Roberts, D.A.O.; Roberts, L.R. Smart Vision-Language Reasoners. CoRR 2024, abs/2407.04212, [2407.04212]. [CrossRef]
- Gao, X.; Wang, J.; Li, S.; Zhang, M.; Zhou, G. Cognition-driven multimodal personality classification. Science China Information Sciences 2022, 65, 202104. [Google Scholar] [CrossRef]
- Li, C. Large Multimodal Models: Notes on CVPR 2023 Tutorial. CoRR 2023, abs/2306.14895, [2306.14895]. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Long, G.; Jiang, D. ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2559–2575.
- Zhou, Y.; Geng, X.; Shen, T.; Long, G.; Jiang, D. Eventbert: A pre-trained model for event correlation reasoning. In Proceedings of the Proceedings of the ACM Web Conference 2022, 2022, pp. 850–859.
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Xu, C.; Long, G.; Jiao, B.; Jiang, D. Towards Robust Ranker for Text Retrieval. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5387–5401.
- Han, S.C.; Cao, F.; Poon, J.; Navigli, R. Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024; Cai, J.; Kankanhalli, M.S.; Prabhakaran, B.; Boll, S.; Subramanian, R.; Zheng, L.; Singh, V.K.; César, P.; Xie, L.; Xu, D., Eds. ACM, 2024, pp. 11294–11295. [CrossRef]
- Xiao, H.; Zhou, F.; Liu, X.; Liu, T.; Li, Z.; Liu, X.; Huang, X. A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine. CoRR 2024, abs/2405.08603, [2405.08603]. [CrossRef]
- Zhang, Z.; Zhong, Y.; Ming, R.; Hu, H.; Sun, J.; Ge, Z.; Zhu, Y.; Jin, X. DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models. CoRR 2024, abs/2408.04275, [2408.04275]. [CrossRef]
- Xie, J.; Zhang, Y.; Lin, M.; Cao, L.; Ji, R. Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024; Cai, J.; Kankanhalli, M.S.; Prabhakaran, B.; Boll, S.; Subramanian, R.; Zheng, L.; Singh, V.K.; César, P.; Xie, L.; Xu, D., Eds. ACM, 2024, pp. 10582–10591. [CrossRef]
- Liu, Z.; Dou, G.; Jia, M.; Tan, Z.; Zeng, Q.; Yuan, Y.; Jiang, M. Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench. CoRR 2024, abs/2410.22108, [2410.22108]. [CrossRef]
- Ko, B.; Gu, G. Large-scale Bilingual Language-Image Contrastive Learning. CoRR 2022, abs/2203.14463, [2203.14463]. [CrossRef]
- Hu, J.; Yao, Y.; Wang, C.; Wang, S.; Pan, Y.; Chen, Q.; Yu, T.; Wu, H.; Zhao, Y.; Zhang, H.; et al. Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages. In Proceedings of the The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
| Model | VQAv2 (Accuracy) | TextVQA (ANLS) | COCO Captions (CIDEr) |
|---|---|---|---|
| BLIP2 | 73.5 | 36.0 | 121.4 |
| InstructBLIP | 75.7 | 38.1 | 125.6 |
| LLaVA-1.5 | 79.7 | 57.4 | 126.4 |
| AMCI-MLLM | 82.0 | 59.1 | 131.5 |
| Variant | VQAv2 (Accuracy) |
|---|---|
| AMCI-MLLM (Full Model) | 82.0 |
| w/o Context-Aware Gating | 78.5 |
| w/o Curriculum Learning | 79.1 |
| Baseline (No Pretraining) | 73.8 |
| Model | Relevance | Accuracy | Fluency |
|---|---|---|---|
| BLIP2 | 78.3 | 76.5 | 80.2 |
| InstructBLIP | 81.4 | 79.7 | 82.1 |
| LLaVA-1.5 | 85.2 | 83.9 | 85.7 |
| AMCI-MLLM | 89.5 | 87.6 | 90.3 |
| Model | Noisy Images (Accuracy) | Ambiguous Questions (Accuracy) |
|---|---|---|
| BLIP2 | 65.1 | 68.3 |
| InstructBLIP | 67.5 | 71.2 |
| LLaVA-1.5 | 70.8 | 74.1 |
| AMCI-MLLM | 74.6 | 77.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).