Preprint
Article

This version is not peer-reviewed.

VAMER: Visual-Anchored Multimodal Evidence Reasoning for Knowledge-Based VQA

Submitted:

25 May 2026

Posted:

25 May 2026

You are already at the latest version

Abstract
Knowledge-Based Visual Question Answering (KB-VQA) relies on external knowledge for cross-modal scene understanding and reasoning. Existing methods still suffer from limited reasoning capability due to two major drawbacks: (1) the visual entity anchoring issue, where current methods fail to accurately anchor visual entities from questions, leading to irrelevant knowledge retrieval and misleading reasoning. (2) the visual-aware reasoning issue, where prior approaches overly rely on text-only reasoning while ignoring visual cues, resulting in unreliable reasoning chains. To this end, we propose VAMER, a Visual-Anchored Multimodal Evidence Reasoning framework with two components: (1) For the visual entity anchoring issue, we introduce a Visual Entity Linking (VEL) module that utilizes the reasoning capability of a Visual-Language Model (VLM) to extract semantic and spatial information from questions, which is used to guide semantic-spatial contrastive learning for entity localization. (2) For the visual-aware reasoning issue, we propose a Multimodal Evidence Chain Reasoning (MECR) module that adopts a hierarchical two-phase approach to separately handle evidence chain construction and answer generation, enabling iterative integration of visual and textual information for improved reasoning reliability. Extensive experiments on the OK-VQA, A-OKVQA, and F-VQA datasets demonstrate the effectiveness of the proposed method for Knowledge-based VQA.
Keywords: 
;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated