Submitted:
17 August 2025
Posted:
19 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- The Structured Reasoning Generation (SRG) module utilizes a powerful MLLM (e.g., InternVL 1.5 [4]) as the reasoning generator. Unlike simply generating a coherent reasoning text, the SRG module employs a multi-stage prompting strategy. For a given video and question, we guide the MLLM to progressively generate different dimensions of reasoning cues, such as event recognition and temporal localization, entity interaction and attributes, causal relationship inference, and intent and prediction. These structured reasoning cues are output as separate text snippets, providing a foundation for subsequent fine-grained integration.
- The Dynamic Reasoning Integration (DRI) module is the core innovation of SRIN. It receives video features extracted by a video encoder, text features from a question encoder, and the multiple structured reasoning text snippets generated by the SRG module. The DRI module incorporates a cross-modal attention mechanism and a gating network. This mechanism enables it to encode each reasoning text snippet into an independent semantic representation, dynamically compute the relevance weight of each reasoning snippet to the current question based on its semantics (e.g., a "causal" question will prioritize causal reasoning snippets, while a "temporal" question will focus on temporal ones), and adaptively fuse these weighted snippet representations into a highly condensed and question-relevant "integrated reasoning representation." This integrated reasoning representation is then combined with the original video and question features and fed into the answer prediction head of the VideoQA main model (e.g., BLIP-FlanT5 [5]) to enhance its reasoning capabilities.
- We propose SRIN, a novel framework that robustly integrates MLLM-generated reasoning into VideoQA models by guiding the generation of structured reasoning and dynamically fusing it.
- We introduce the Structured Reasoning Generation (SRG) module, which employs a multi-stage prompting strategy to elicit multi-dimensional, fine-grained reasoning cues from MLLMs.
- We develop the Dynamic Reasoning Integration (DRI) module, a key innovation that adaptively weights and fuses structured reasoning components based on question semantics, enhancing the model’s ability to utilize imperfect MLLM outputs effectively.
2. Related Work
2.1. Video Question Answering
2.2. Multimodal Large Language Models for Reasoning
3. Method
3.1. Structured Reasoning Generation (SRG) Module
3.2. Dynamic Reasoning Integration (DRI) Module
3.2.1. Encoding Structured Reasoning Snippets
3.2.2. Question-Guided Attention
3.2.3. Adaptive Fusion
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Model Architectures
4.1.3. Training Details
4.2. Comparison with State-of-the-Art Methods
4.3. Ablation Study
4.4. Human Evaluation
4.5. Qualitative Analysis of Structured Reasoning Snippets
4.6. Impact of MLLM Backbone for Structured Reasoning Generation
4.7. Analysis of Dynamic Reasoning Integration Weights
4.8. Computational Efficiency and Inference Speed
5. Conclusion
References
- Cabalar, P.; Schaub, T. Temporal Logic Programs with Temporal Description Logic Axioms. In Proceedings of the Description Logic, Theory Combination, and All That - Essays Dedicated to Franz Baader on the Occasion of His 60th Birthday. Springer, 2019, pp. 174–186. [CrossRef]
- Zhang, D.; Yu, Y.; Dong, J.; Li, C.; Su, D.; Chu, C.; Yu, D. MM-LLMs: Recent Advances in MultiModal Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 12401–12430. [CrossRef]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
- Chen, Z.; Wang, W.; Tian, H.; Ye, S.; Gao, Z.; Cui, E.; Tong, W.; Hu, K.; Luo, J.; Ma, Z.; et al. How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites. Sci. China Inf. Sci. 2024. [Google Scholar] [CrossRef]
- Albuquerque, I.; Schrouff, J.; Warde-Farley, D.; Cemgil, A.T.; Gowal, S.; Wiles, O. Evaluating Model Bias Requires Characterizing its Mistakes. In Proceedings of the Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
- Lei, J.; Yu, L.; Bansal, M.; Berg, T.L. TVQA: Localized, Compositional Video Question Answering. In Proceedings of the Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. Association for Computational Linguistics, 2018, pp. 1369–1379. [CrossRef]
- Jang, Y.; Song, Y.; Kim, C.D.; Yu, Y.; Kim, Y.; Kim, G. Video Question Answering with Spatio-Temporal Reasoning. Int. J. Comput. Vis. 2019, pp. 1385–1412. [CrossRef]
- Yang, Z.; Garcia, N.; Chu, C.; Otani, M.; Nakashima, Y.; Takemura, H. BERT Representations for Video Question Answering. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020. IEEE, 2020, pp. 1545–1554. [CrossRef]
- Han, J.; Kim, S.; Park, H. MELA: Multi-Event Localization Answering Framework for Video Question Answering. In Proceedings of the Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, SAC 2025, Catania International Airport, Catania, Italy, 31 March 2025 - 4 April 2025. ACM, 2025, pp. 1282–1289. [CrossRef]
- Zang, C.; Wang, H.; Pei, M.; Liang, W. Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023, pp. 19027–19036. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Long, G.; Jiang, D. ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2559–2575.
- Zhou, Y.; Long, G. Multimodal Event Transformer for Image-guided Story Ending Generation. In Proceedings of the Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3434–3444.
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Shen, J.; Long, G.; Xu, C.; Jiang, D. Fine-grained distillation for long document retrieval. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19732–19740.
- Wang, Q.; Hu, H.; Zhou, Y. Memorymamba: Memory-augmented state space model for defect recognition. arXiv preprint arXiv:2405.03673 2024.
- Li, J.; Wei, P.; Han, W.; Fan, L. IntentQA: Context-aware Video Intent Reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. IEEE, 2023, pp. 11929–11940. [CrossRef]
- Yu, Z.; Xu, D.; Yu, J.; Yu, T.; Zhao, Z.; Zhuang, Y.; Tao, D. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 2019, pp. 9127–9134. [CrossRef]
- Zhang, J.; Shao, J.; Cao, R.; Gao, L.; Xu, X.; Shen, H.T. Action-Centric Relation Transformer Network for Video Question Answering. IEEE Trans. Circuits Syst. Video Technol. 2022, pp. 63–74. [CrossRef]
- Amini, M.H.; Mia, M.J.; Saadati, Y.; Imteaj, A.; Nabavirazavi, S.; Thakker, U.; Hossain, M.Z.; Fime, A.A.; Iyengar, S.S. Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions. CoRR 2025. [CrossRef]
- Wang, Y.; Chen, W.; Han, X.; Lin, X.; Zhao, H.; Liu, Y.; Zhai, B.; Yuan, J.; You, Q.; Yang, H. Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning. CoRR 2024. [CrossRef]
- Son, M.; Lee, S. Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance. Applied Sciences 2025.
- Dong, Y.; Liu, Z.; Sun, H.; Yang, J.; Hu, W.; Rao, Y.; Liu, Z. Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 2025, pp. 9062–9072.
- Zhou, Y.; Song, L.; Shen, J. Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback. arXiv preprint arXiv:2501.01377 2025.
- Le, C.C.; Vinh, H.C.T.; Phan, H.N.; Le, D.D.; Nguyen, T.N.; Bui, N.D.Q. VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 - May 4, 2025. Association for Computational Linguistics, 2025, pp. 6628–6645. [CrossRef]
- Zhou, Y.; Song, L.; Shen, J. MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration. arXiv preprint arXiv:2506.19835 2025.
- Yan, Q.; Fan, Y.; Li, H.; Jiang, S.; Zhao, Y.; Guan, X.; Kuo, C.; Wang, X.E. Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025. Association for Computational Linguistics, 2025, pp. 18829–18845.
| Model | NExT-QA | STAR | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Temporal | Causal | Descriptive | Total | Int. | Seq. | Pre. | Fea. | Total | |
| MotionEpic | 74.6 | 75.8 | 83.3 | 76.0 | 71.5 | 72.6 | 66.6 | 62.7 | 71.0 |
| LLaMA-VQA | 69.2 | 72.7 | 75.8 | 72.0 | 66.2 | 67.9 | 57.2 | 52.7 | 65.4 |
| VidF4 | 69.6 | 74.2 | 83.3 | 74.1 | 68.4 | 70.4 | 60.9 | 59.4 | 68.1 |
| ReasVQA | 75.1 | 76.2 | 83.5 | 76.4 | 71.8 | 72.9 | 66.9 | 63.0 | 71.3 |
| Ours (SRIN) | 75.9 | 76.8 | 83.9 | 77.1 | 72.5 | 73.4 | 67.5 | 63.6 | 72.0 |
| Model | Why | How | Tem. | Total |
|---|---|---|---|---|
| LVNet | 75.2 | 71.6 | 60.8 | 71.1 |
| CaVIR | 58.4 | 65.5 | 50.5 | 57.6 |
| BlindGPT | 52.2 | 61.3 | 43.4 | 51.6 |
| ReasVQA | 75.5 | 71.8 | 61.0 | 71.4 |
| Ours (SRIN) | 76.0 | 72.3 | 61.5 | 71.9 |
| Model Variant | Temporal | Causal | Descriptive | Total |
|---|---|---|---|---|
| BLIP-FlanT5 (Baseline) | 71.2 | 72.5 | 80.1 | 73.5 |
| SRIN (Flat Reasoning) | 73.5 | 74.0 | 81.2 | 74.9 |
| SRIN (Static Fusion) | 74.8 | 75.5 | 82.8 | 75.9 |
| Ours (SRIN) | 75.9 | 76.8 | 83.9 | 77.1 |
| Model | Correctness | Coherence | Completeness |
|---|---|---|---|
| ReasVQA | 3.85 | 3.70 | 3.60 |
| Ours (SRIN) | 4.10 | 4.05 | 3.95 |
| Video/Question ID | Question | Type | R1: Event Recognition and Temporal Localization | R2: Entity Interaction and Attributes | R3: Causal Relationship Inference | R4: Intent and Prediction |
|---|---|---|---|---|---|---|
| NExT-QA V1234 | Why did the person fall? | Causal | A person is walking, then slips and falls on the ground. The fall occurs suddenly. | A person is present, wearing a red shirt. The ground appears wet or icy. | The person fell because they lost their balance after slipping on the wet ground. | The person intended to walk forward. The fall was an accidental outcome. |
| STAR V5678 | What will the person do after picking up the ball? | Prediction | A person reaches down and picks up a basketball from the floor. | A person, a basketball, a basketball court. The person holds the ball. | Picking up the ball is a prerequisite for dribbling or shooting in basketball. | The person intends to play basketball. They will likely dribble, pass, or shoot the ball next. |
| NExT-QA V9012 | What is the color of the car? | Descriptive | A car is shown driving on a road. No significant events related to color change. | A car is visible. Its primary attribute is its blue color. | The car’s color is a static attribute, not a result of a causal chain within the video. | No direct intent or prediction is implied by the car’s color. |
| MLLM for SRG | NExT-QA (Total Acc.) | STAR (Total Acc.) | IntentQA (Total Acc.) |
|---|---|---|---|
| Generic MLLM B (e.g., MiniGPT4-v2) | 75.2 | 70.1 | 70.0 |
| Generic MLLM A (e.g., LLaVA-1.5) | 76.3 | 71.0 | 70.8 |
| InternVL v1.5 (Ours) | 77.1 | 72.0 | 71.9 |
| Question Type | Example Question | Key Insight | ||||
|---|---|---|---|---|---|---|
| Causal | Why did the person fall? | 0.20 | 0.15 | 0.45 | 0.20 | High weight on Causal reasoning (R3) for "Why" questions. |
| Prediction | What will the person do next? | 0.25 | 0.10 | 0.15 | 0.50 | Strong emphasis on Intent/Prediction (R4) for future actions. |
| Descriptive | What color is the car? | 0.15 | 0.55 | 0.10 | 0.20 | Dominant weight on Entity/Attribute (R2) for descriptive queries. |
| Temporal | When did the action start? | 0.40 | 0.20 | 0.20 | 0.20 | Higher focus on Event/Temporal (R1) for temporal localization. |
| Model | Inference Time (ms/QA Pair) | Total Parameters (M) | GPU Memory (GB) |
|---|---|---|---|
| BLIP-FlanT5 (Baseline) | 180 | 3000 (FlanT5 3B) | 12 |
| ReasVQA | 210 | ∼3200 | 14 |
| Ours (SRIN) | 195 | ∼3050 | 13 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).