Causal Dual-Interventional MedVQA via Textual Perturbation and Counterfactual Visual Verification

Jiuxiang You; Yi Yu; Zhenguo Yang

doi:10.20944/preprints202605.1802.v1

Submitted:

25 May 2026

Posted:

27 May 2026

You are already at the latest version

Abstract

Medical Visual Question Answering (MedVQA) aims to answer medical questions from clinical images. However, current models often rely on spurious language shortcuts rather than visual evidence, compromising clinical reliability. To this end, we propose a causal dual-interventional framework to mitigate language shortcuts in MedVQA. Our method incorporates two components: a textual de-confounding module and a counterfactual visual verifier. The textual de-confounding module disrupts linguistic shortcut biases via concept-agnostic perturbations to block backdoor pathways. Meanwhile, it aligns clinical terms with anatomical regions, compelling the model to establish genuine visual dependencies. In addition, the counterfactual visual verifier evaluates visual reliance by masking key regions and measuring prediction confidence drops under occlusion, thereby reducing language-driven artifacts. Extensive experiments on two public datasets demonstrate that our method significantly outperforms existing baselines.

Keywords:

medical visual question answering

;

concept-agnostic perturbations

;

clinical term grounding

;

counterfactual visual verifier

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Causal Dual-Interventional MedVQA via Textual Perturbation and Counterfactual Visual Verification

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe