Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

A Fusion Encoder with Multi-Task Guided for Cross-Modal Text-Image Retrieval in Remote Sensing

Version 1 : Received: 27 June 2023 / Approved: 28 June 2023 / Online: 29 June 2023 (08:11:38 CEST)

A peer-reviewed article of this Preprint also exists.

Zhang, X.; Li, W.; Wang, X.; Wang, L.; Zheng, F.; Wang, L.; Zhang, H. A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing. Remote Sens. 2023, 15, 4637. Zhang, X.; Li, W.; Wang, X.; Wang, L.; Zheng, F.; Wang, L.; Zhang, H. A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing. Remote Sens. 2023, 15, 4637.

Abstract

In recent years, there has been a growing interest in remote sensing image-text cross-modal retrieval due to the rapid development of space information technology and the significant increase in remote sensing image data volume. One approach that has shown promising results in cross-modal retrieval of natural images is the multimodal fusion encoding method. However, remote sensing images have unique characteristics that make the retrieval task challenging. Firstly, the semantic features of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Additionally, these images exhibit variations in resolution, color, and perspective. Different combinations of basic units of semantic expression can generate diverse text descriptions. These characteristics pose considerable challenges for cross-modal retrieval. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method. The model incorporates three tasks: image-text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC) task. By jointly training the model with these tasks, we aim to enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model’s consistency in feature expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method. This method achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness. Overall, this study focuses on remote sensing image-text cross-modal retrieval and introduces the MTGFE model, which combines multimodal fusion encoding with multiple tasks to enhance the model’s ability to capture fine-grained correlations. Additionally, a retrieval filtering method is proposed to improve retrieval efficiency. Experimental results demonstrate the effectiveness of the proposed method.

Keywords

cross-modal retrieval; remote sensing images; fusion encoding method; joint representation; contrastive learning

Subject

Environmental and Earth Sciences, Remote Sensing

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.