Preprint
Article

This version is not peer-reviewed.

Causal Dual-Interventional MedVQA via Textual Perturbation and Counterfactual Visual Verification

Submitted:

25 May 2026

Posted:

27 May 2026

You are already at the latest version

Abstract
Medical Visual Question Answering (MedVQA) aims to answer medical questions from clinical images. However, current models often rely on spurious language shortcuts rather than visual evidence, compromising clinical reliability. To this end, we propose a causal dual-interventional framework to mitigate language shortcuts in MedVQA. Our method incorporates two components: a textual de-confounding module and a counterfactual visual verifier. The textual de-confounding module disrupts linguistic shortcut biases via concept-agnostic perturbations to block backdoor pathways. Meanwhile, it aligns clinical terms with anatomical regions, compelling the model to establish genuine visual dependencies. In addition, the counterfactual visual verifier evaluates visual reliance by masking key regions and measuring prediction confidence drops under occlusion, thereby reducing language-driven artifacts. Extensive experiments on two public datasets demonstrate that our method significantly outperforms existing baselines.
Keywords: 
;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated