Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering

Version 1 : Received: 21 April 2018 / Approved: 24 April 2018 / Online: 24 April 2018 (09:09:45 CEST)

How to cite: Lao, M.; Guo, Y.; Wang, H.; Zhang, X. Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering. Preprints 2018, 2018040313. https://doi.org/10.20944/preprints201804.0313.v1 Lao, M.; Guo, Y.; Wang, H.; Zhang, X. Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering. Preprints 2018, 2018040313. https://doi.org/10.20944/preprints201804.0313.v1

Abstract

Visual question answering (VQA) is receiving increasing attention from researchers in both the computer vision and natural language processing fields. There are two key components in the VQA task: feature extraction and multi-modal fusion. For feature extraction, we introduce a novel co-attention scheme by combining Sentence-guide Word Attention (SWA) and Question-guide Image Attention (QIA) in a unified framework. To be specific, the textual attention SWA relies on the semantics of the whole question sentence to calculate contributions of different question words for text representation. For the multi-modal fusion, we propose a “Cross-modal Multistep Fusion (CMF)” network to generate multistep features and achieve multiple interactions for two modalities, rather than focusing on modeling complex interactions between two modals like most current feature fusion methods. To avoid the linear increase of the computational cost, we share the parameters for each step in the CMF. Extensive experiments demonstrate that the proposed method can achieve competitive or better performance than the state-of-the-art.

Keywords

visual question answering; cross-modal multistep fusion network; attention mechanism

Subject

Computer Science and Mathematics, Computer Networks and Communications

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.