Submitted:
12 April 2024
Posted:
12 April 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Cross-modal Alignment
2.2. Cross-Modal Fusion
3. Materials and Methods
3.1. Contrastive Encoder Initialization
3.2. Residual Attention Neural Network
3.3. Captioner Training Objectives
3.4. Offline Cross-Module Information Propagation Based on Momentum Encoder
3.4.1. Momentum Encoder
3.4.2. Image-Text Contrastive Loss Funtion
3.4.3. Offline Cross-Module Information Propagation
3.4.4. Why Contrastive learning and Momentum Encoding
4. Experiments
4.1. Dataset Settings
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Evaluation on the MSCOCO Dataset
4.5. Evaluation on the iNaturalist Dataset
4.6. Qualitative Evaluation
4.7. Ablation Study
5. Conclusion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Matin, M.; Shrestha, T.; Chitale, V.; Thomas, S. Exploring the potential of deep learning for classifying camera trap data of wildlife: a case study from Nepal. In Proceedings of the AGU Fall Meeting Abstracts, 2021, pp. GC45I–0923.
- Norouzzadeh, M.S.; Nguyen, A.; Kosmala, M.; Swanson, A.; Palmer, M.S.; Packer, C.; Clune, J. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences 2018, 115, E5716–E5725. [CrossRef]
- Zett, T.; Stratford, K.J.; Weise, F. Inter-observer variance and agreement of wildlife information extracted from camera trap images. Biodiversity and Conservation 2022, 31, 3019–3037. [CrossRef]
- Swanson, A.; Kosmala, M.; Lintott, C.; Simpson, R.; Smith, A.; Packer, C. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Scientific data 2015, 2, 1–14. [CrossRef]
- McShea, W.J.; Forrester, T.; Costello, R.; He, Z.; Kays, R. Volunteer-run cameras as distributed sensors for macrosystem mammal research. Landscape Ecology 2016, 31, 55–66. [CrossRef]
- Edwards, S.; Portas, R.; Hanssen, L.; Beytell, P.; Melzheimer, J.; Stratford, K. The spotted ghost: density and distribution of serval Leptailurus serval in Namibia. African Journal of Ecology 2018, 56, 831–840. [CrossRef]
- Stratford, K.; Stratford, S.; Périquet, S. Dyadic associations reveal clan size and social network structure in the fission–fusion society of spotted hyaenas. African Journal of Ecology 2020, 58, 182–192. [CrossRef]
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive learning of medical visual representations from paired images and text (2020). arXiv preprint arXiv:2010.00747 2020. [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. Proceedings of Machine Learning Research, 2021, pp. 8748–8763.
- Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; Li, T. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 2022, 508, 293–304. [CrossRef]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International conference on machine learning. Proceedings of Machine Learning Research, 2021, pp. 4904–4916.
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR 2020, 2020, pp. 9729–9738.
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C. Prototypical Contrastive Learning of Unsupervised Representation. In Proceedings of the International Conference on Learning Representations. ICLR2021, 2021.
- Li, J.; Xiong, C.; Hoi, S. MoPro: Webly Supervised Learning with Momentum Prototypes. In Proceedings of the International Conference on Learning Representations. ICLR2021, 2021.
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX. Springer, 2020, pp. 104–120.
- Xu, X.; Wang, T.; Yang, Y.; Zuo, L.; Shen, F.; Shen, H.T. Cross-modal attention with semantic consistence for image–text matching. IEEE transactions on neural networks and learning systems 2020, 31, 5412–5425. [CrossRef]
- Diao, H.; Zhang, Y.; Ma, L.; Lu, H. Similarity reasoning and filtration for image-text matching. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence. AAAI, 2021, pp. 1218–1226.
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 2020, pp. 121–137.
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 1137–1149. [CrossRef]
- Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv:2104.13921 2021. [CrossRef]
- Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 2022. [CrossRef]
- Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR 2022, 2022, pp. 18134–18144.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I.; Luxburg, U.V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Eds. Curran Associates, Inc., 2017.
- Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018. arXiv:1810.04805.
- Bao, H.; Wang, W.; Dong, L.; Wei, F. Vl-beit: Generative vision-language pretraining. arXiv preprint arXiv:2206.01127 2022. arXiv:2206.01127.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020. arXiv:2010.11929.
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR 2022, 2022, pp. 16000–16009.
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 12888–12900.
- Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O.K.; Singhal, S.; Som, S.; et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 2022. arXiv:2208.10442.
- Li, Y.; Fan, H.; Hu, R.; Feichtenhofer, C.; He, K. Scaling Language-Image Pre-training via Masking. arXiv preprint arXiv:2212.00794 2022. arXiv:2212.00794.
- Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O.K.; Aggarwal, K.; Som, S.; Piao, S.; Wei, F. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds. Curran Associates, Inc., 2022, pp. 32897–32912.
- Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 2022. arXiv:2205.01917.
- Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 2021. arXiv:2108.10904.
- Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. arXiv preprint arXiv:2205.12005 2022. arXiv:2205.12005.
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 2018. arXiv:1807.03748.
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision – ECCV 2014; Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T., Eds. Springer International Publishing, 2014, pp. 740–755.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
- Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Proceedings of the ninth workshop on statistical machine translation, 2014, pp. 376–380.
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR 2015, 2015, pp. 4566–4575.
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016, pp. 382–398.
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2020, pp. 13041–13049.
- Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 2021. arXiv:2111.09734.
- Dou, Z.Y.; Xu, Y.; Gan, Z.; Wang, J.; Wang, S.; Wang, L.; Zhu, C.; Zhang, P.; Yuan, L.; Peng, N.; et al. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18166–18176.
- Cheng, M.; Sun, Y.; Wang, L.; Zhu, X.; Yao, K.; Chen, J.; Song, G.; Han, J.; Liu, J.; Ding, E.; et al. ViSTA: vision and scene text aggregation for cross-modal retrieval. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5184–5193.
- Messina, N.; Stefanini, M.; Cornia, M.; Baraldi, L.; Falchi, F.; Amato, G.; Cucchiara, R. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In Proceedings of the Proceedings of the 19th International Conference on Content-based Multimedia Indexing, 2022, pp. 64–70.
- Diao, Q.; Jiang, Y.; Wen, B.; Sun, J.; Yuan, Z. Metaformer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751 2022. arXiv:2203.02751.
- Girdhar, R.; Singh, M.; Ravi, N.; van der Maaten, L.; Joulin, A.; Misra, I. Omnivore: A single model for many visual modalities. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16102–16112.
- Touvron, H.; Sablayrolles, A.; Douze, M.; Cord, M.; Jégou, H. Grafit: Learning fine-grained image representations with coarse labels. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 874–884.
- Tian, C.; Wang, W.; Zhu, X.; Dai, J.; Qiao, Y. Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV. Springer, 2022, pp. 73–91.
- Gesmundo, A. A Continual Development Methodology for Large-scale Multitask Dynamic ML Systems. arXiv preprint arXiv:2209.07326 2022. arXiv:2209.07326.
- Liu, J.; Huang, X.; Liu, Y.; Li, H. Mixmim: Mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137 2022. arXiv:2205.13137.
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 10347–10357.
- Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision. ICCV, 2021, pp. 579–588.
- Cui, J.; Zhong, Z.; Tian, Z.; Liu, S.; Yu, B.; Jia, J. Generalized Parametric Contrastive Learning. arXiv preprint arXiv:2209.12400 2022. arXiv:2209.12400.









| Images | ![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|
| Captions | Two geese are walking on the shore of a pond. | A bunch of yellow flowers are sitting in a field. | A Catasticta nimbice is sitting on an Ageratum houstonianum in the sun. | An Aepyceros melampus grazing in a field. |
| Method | B4 | C | M | S |
|---|---|---|---|---|
| Oscar[19] | 36.6 | 124.1 | 30.4 | 23.2 |
| BUTD[44] | 36.2 | 113.5 | 27.0 | 20.3 |
| UnifiedVLP[45] | 33.53 | 113.1 | 27.5 | 21.1 |
| ClipCap[46] | 33.5 | 113.1 | 27.5 | 21.1 |
| ReCap | 39.8 | 126.7 | 31.6 | 24.4 |
| Method | Retrieval I2T | Retrieval T2I | ||||
|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
| Oscar[19] | 57.5 | 82.8 | 89.8 | 73.5 | 92.2 | 96.0 |
| METER[47] | 57.1 | 82.7 | 90.1 | 76.2 | 93.2 | 96.8 |
| ViSTA[48] | 52.6 | 79.6 | 87.6 | 68.9 | 90.1 | 95.4 |
| ALADIN[49] | 51.3 | 79.2 | 87.5 | 64.9 | 88.6 | 94.5 |
| ReCap | 65.5 | 89.2 | 92.9 | 77.1 | 92.6 | 96.3 |
| Method | Top1 Accuracy |
|---|---|
| MetaFormer[50] | 84.3 |
| OMNIVORE[51] | 84.1 |
| RegNet-8GF[52] | 81.2 |
| VL-LTR[53] | 81.0 |
| 2Net+[54] | 81.0 |
| MixMIM-L[55] | 80.3 |
| DeiT-B[56] | 79.5 |
| CeiT-s[57] | 79.4 |
| GPaCo[58] | 78.1 |
| ReCap | 85.1 |
| Images | ![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|
| Captions | A few Abudefduf saxatilis swim in the stony water. | There are some red Castilleja indivisa in the grass. | A Libellula quadrimaculata is flying over the water. | A Ursus arctos horribilis and her cubs on a green field. |
| Query | A photo of Leopardus pardalis. | A photo of Phoenicopterus rubber. | A photo of Aglais io. |
|---|---|---|---|
| Dataset | Wildlife Conservation Society | Birds 510 Species-Image Classification | Animals Detection Images Dataset |
| Result | ![]() |
![]() |
![]() |
| Caption | A small Leopardus pardalis walking through a forest at night. | A pink Phoenicopterus ruber standing in the water. | A close-up of an Aglais io is sitting on top of a flower. |
| MSCOCO | iNaturalist2018 | |||||
|---|---|---|---|---|---|---|
| Module Composition | I2T-R@1 | T2I-R@1 | Cap-B4 | I2T-R@1 | T2I-R@1 | Cap-B4 |
| C+C | 51.5 | 75.2 | 31.9 | 54.1 | 68.9 | 32.3 |
| C+R+C | 51.3 | 75.7 | 35.3 | 53.7 | 69.5 | 36.1 |
| ReCap | 65.5 | 77.1 | 39.8 | 63.6 | 72.2 | 41.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).










