Lu, Q.; Chen, S.; Zhu, X. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering. J. Imaging2024, 10, 56.
Lu, Q.; Chen, S.; Zhu, X. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering. J. Imaging 2024, 10, 56.
Lu, Q.; Chen, S.; Zhu, X. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering. J. Imaging2024, 10, 56.
Lu, Q.; Chen, S.; Zhu, X. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering. J. Imaging 2024, 10, 56.
Abstract
Language bias stands as a noteworthy concern in Visual Question Answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. To address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model’s neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multi-modal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.
Keywords
visual question answering; collaborative learning; language bias
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.