Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective

Version 1 : Received: 12 April 2024 / Approved: 12 April 2024 / Online: 12 April 2024 (10:42:49 CEST)

How to cite: Tao, R.; Zhu, M.; Cao, H.; Ren, H. Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective. Preprints 2024, 2024040847. https://doi.org/10.20944/preprints202404.0847.v1 Tao, R.; Zhu, M.; Cao, H.; Ren, H. Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective. Preprints 2024, 2024040847. https://doi.org/10.20944/preprints202404.0847.v1

Abstract

The essence of cross-modal generation and retrieval modeling for image-text interactions lies in semantic consistency. Among these, cross-modal representation alignment serves as the foundational requirement for achieving cross-modal semantic consistency. The crux of our proposed method entails the collaborative training of an image captioning model(referred to as the ’captioner’) and a pair of contrastive encoders for image-text matching(referred to as the ’concoder’). This synergistic approach yields concurrent enhancements in concoder’s performance, captioner’s proficiency, and cross-modal semantic coherence. We coin the proposed method as ‘ReCap’. To begin, we initialize the concoder through knowledge distillation from a pre-trained model. During the training process, we use the output of the initialized concoder as input to a residual attention network, jointly training it with the captioner to achieve semantic consistency. Subsequently, we iteratively update the concoder during the image-text momentum encoding phase using the residual attention network, creating a closed-loop for semantic consistency resolution. Upon completion of training, the concoder and captioner modules can be used independently to perform tasks related to cross-modal retrieval and cross-modal generation of text and images. In order to substantiate the efficacy of our proposed method, we initially conducted empirical experiments encompassing image-text retrieval and image captioning tasks, employing a widely recognized benchmark dataset, MSCOCO. Subsequently, we assessed the fine-grained cross-modal alignment performance of the concoder through an image classification task on the iNaturalist 2018 dataset. The achieved performance metrics notably surpassed those achieved by several incumbent state-of-the-art models, thus validating the proficiency of our method. Finally, motivated by the practical requirements of handling the nature conservation image data, we performed caption annotation for the iNaturalist 2018 dataset. Subsequently, we trained the ReCap on this annotated data. Experimental results demonstrate that our proposed method can maintain semantic consistency between cross-modal retrieval and image caption generation for species with similar visual features but distinct semantics. This achievement signifies a meaningful exploration in the field of cross-modal semantic consistency representation, holding significant practical value in the domain of biological research.

Keywords

cross-modal; multi-task; image captioning; cross-modal retrieval; cross-modal alignment

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.